`sparknlp_jsl.structured_deidentification`#

Utility class that helps to obfuscate tabular data.

Module Contents#

Classes#

StructuredDeidentification

A helper class that allow to obfuscate a structured deidentification.

class StructuredDeidentification(spark: pyspark.sql.SparkSession, columns: Dict[str, str], columnsSeed: Dict[str, int] = None, obfuscateRefFile: str = '', obfuscateRefSource: str = 'both', days: int = 0, useRandomDateDisplacement: bool = False, dateFormats: List[str] = None, language: str = 'en', idColumn: str = '', region: str = '', keepYear: bool = False, keepMonth: bool = False, unnormalizedDateMode: str = 'obfuscate', keepTextSizeForObfuscation: bool = False, fakerLengthOffset: int = 3, genderAwareness: bool = False, ageRangesByHipaa: bool = False, consistentAcrossNameParts: bool = True, selectiveObfuscateRefSource: Dict[str, str] = {})#

A helper class that allow to obfuscate a structured deidentification.

Parameters:

columns (dict) –
It is a dictionary that allows to select the name of the column with the entity. The key of the dictionary is the column in the dataframe and the value of the dictionary is the entity for that column. The default entities are:
- ”location” A general location.
- ”location-other” A location that is not country, street,hospital,city or state
- ”street” A street
- ”hospital” The name of a hospital.
- ”city” A city
- ”state” A state
- ”zip” The zip code
- ”country” A country
- ”contact” The contact of one person
- ”username” A username
- ”phone” A number phone.
- ”fax” The number fax
- ”url” A url for internet
- ”email” The email of one person
- ”profession” A profession of one person
- ”name” The name opf one person
- ”doctor” The name of a doctor
- ”patient” The name of the patient
- ”first_name” The first name of one person
- ”last_name” The last name of one person
- ”id” A general ID number
- ”bioid” It is a system to screen for protein interactions as they occur in living cells
- ”age” The age of something or someone
- ”organization” Name of one organization or company
- ”healthplan” The id that identify the health plan
- ”medicalrecord” The identification of a medical record
- ”device” The id that identified a device
- ”date” A general date
- ”ssn” A Social Security Number
- ”ip” A Internet Protocol
- ”passport” A random passport
- ”dln” A Driver’s License Number
- ”npi” A National Provider Identifier
- ”c_card” A credit card number
- ”iban” A International Bank Account Number
- ”dea” A Drug Enforcement Administration
columnsSeed (dict) – Allow to add a seed to the column that you want to obfuscate.The seed used to randomly select the entities used during obfuscation mode.
obfuscateRefFile (str) – This is an optional parameter that allows to add your own terms to be used for obfuscation. The file contains as a key the entity and as the value the terms that will be used in the obfuscation.
days (int) – Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used
useRandomDateDisplacement (bool) – Use a random displacement days in dates entities. If true, use a random displacement days in dates entities, otherwise use the days parameter.
dateFormats (List[str]) – List of date formats. Example: [“dd-MM-yyyy”, “dd/MM/yyyy”, “d/M/yyyy”, “dd-MM-yyyy”, “d-M-yyyy”]
language (str) –
The language used to select faker entities. The values are the following:
- ’en’ (English)
- ’de’ (German)
- ’es’ (Spanish)
- ’fr’ (French)
- ’ar’ (Arabic)
- ’ro’ (Romanian)
Default:’en’.
idColumn (str) – The column that contains the id of the row. If provided, data will obfuscate consistently by idColumn, especially date entities.
region (str) – With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. The values are the following: ‘eu’ for European Union, ‘us’ for USA. Default is ‘’. It means given dateFormats will be used.
keepYear (bool) – Whether to keep the year intact when obfuscating date entities. If true, the year will remain unchanged during the obfuscation process. Default is False.
keepMonth (bool) – Whether to keep the month intact when obfuscating date entities. If true, the month will remain unchanged during the obfuscation process. Default is False.
unnormalizedDateMode (str) – The mode to use if the date is not formatted. The values are the following: ‘mask’, ‘obfuscate’, ‘skip’. Default: obfuscate.
keepTextSizeForObfuscation (bool) – Whether the output should maintain the same character length as the input text. Default is False.
fakerLengthOffset (int) – It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Value must be greater than 0. Default is 3.
genderAwareness (bool) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is True, it might decrease performance. Default is False.
ageRangesByHipaa (bool) – Whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule. Default is False.
consistentAcrossNameParts (bool) –
Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name).

When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.
For example, if “John Smith” is obfuscated as “Liam Brown”, then:
- When the full name “John Smith” appears, it will be replaced with “Liam Brown”
- When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation.
Default: True
selectiveObfuscateRefSource (Dict[str, str]) – A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source. Possible values in dict for the obfuscation source are: ‘faker’, ‘both’, ‘file’.

columns#

instance#

obfuscateRefFile = ''#

spark#

obfuscateColumns(df: pyspark.sql.DataFrame, outputAsArray: bool = True, overwrite: bool = True, suffix: str = '_obfuscated')#

Obfuscate the columns of a dataframe.

Parameters:

df (DataFrame) – The dataframe to obfuscate
outputAsArray (bool) – If True, the output will be an array of strings, otherwise will be a string. Default: True.
overwrite (bool) – If True, the columns will be overwritten, otherwise will be added to the dataframe. Default: True.
suffix (str) – The suffix to add to the obfuscated columns if overwrite is False. Default: “_obfuscated”. It must not be an empty string.

Returns:

A dataframe with the columns obfuscated

Return type:

DataFrame

sparknlp_jsl.structured_deidentification#

Module Contents#

Classes#

`sparknlp_jsl.structured_deidentification`#