sparknlp_jsl.annotator.deid.base_deidentification_params
#
Module Contents#
Classes#
Base parameters for DeIdentification annotators. |
- class BaseDeIdentificationParams#
Base parameters for DeIdentification annotators.
- Parameters:
mode (str) – Mode for Anonimizer [‘mask’|’obfuscate’]
dateEntities (list[str]) – List of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’, ‘EFFDATE’, ‘FISCAL_YEAR’]
obfuscateDate (bool) – When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to
True
, make sure dateFormats param fits the needs. If the value is True and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting to ‘False’, then the date will be masked to <DATE>. Default: FalseunnormalizedDateMode (str) – The mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.
days (int) – A number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used.
useShiftDays (bool) – Whether to use the random shift day when the document has this in its metadata. Default: False
dateFormats (list[str]) – List of date formats to automatically displace if parsed.
additionalDateFormats (list[str]) – Additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default date formats. Default: [].
region (str) – The region to use for date parsing. This property is especially used when obfuscating dates. You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’
keepYear (bool) – Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
keepMonth (bool) – Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
obfuscateRefSource (str) – The source of obfuscation of to obfuscate the entities. For dates entities, This property is invalid. The values ar the following: ‘custom’ or ‘file’: Takes the fakers from the setCustomFakers function or from the file. faker: Takes the fakers from the Faker module both : Takes the fakers from the setCustomFakers function and the faker module.
selectiveObfuscateRefSource (Dict[str, str]) – A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source. Possible values in dict for the obfuscation source are: ‘custom’, ‘faker’, ‘both’, ‘file’.
language (str) – The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.
seed (int) – It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.
maskingPolicy (str) –
- Select the masking policy:
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk. entity_labels_without_brackets: Replace the values with the entity value without brackets. same_length_chars_without_brackets: Replace the name with the asterix with same length without brackets.
fixedMaskLength (int) – The length of the masking sequence in case of fixed_length_chars masking policy.
sameLengthFormattedEntities (list[str]) – List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE.
genderAwareness (bool) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If the value is true, it might decrease performance. Default: False
ageRanges (list[int]) – list of integer specifying limits of the age groups to preserve during obfuscation.
ageRangesByHipaa – Whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule. The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged. If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.
consistentAcrossNameParts (bool) –
Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name).
When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.
- For example, if “John Smith” is obfuscated as “Liam Brown”, then:
When the full name “John Smith” appears, it will be replaced with “Liam Brown”
When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation. Default: True
keepTextSizeForObfuscation (bool) – Used to specify that the required obfuscated text length should be same as input length. Default: False.
fakerLengthOffset (int) – Used to specify how much length deviation is accepted in obfuscation output. keepTextSizeForObfuscation param should be enabled. Must be greater than 0. Default is 3.
geoConsistency (bool) – Sets whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone. This parameter enables intelligent geographical entity obfuscation that maintains realistic relationships between different geographic components. When enabled, the system ensures that obfuscated addresses form coherent, valid combinations rather than random replacements. Default: False
countryObfuscation (bool) – Whether to obfuscate country entities or not. If True, the country entities will be obfuscated. Default: False.
- additionalDateFormats#
- ageRanges#
- ageRangesByHipaa#
- consistentAcrossNameParts#
- countryObfuscation#
- dateEntities#
- dateFormats#
- days#
- fakerLengthOffset#
- fixedMaskLength#
- genderAwareness#
- geoConsistency#
- keepMonth#
- keepTextSizeForObfuscation#
- keepYear#
- language#
- maskingPolicy#
- mode#
- obfuscateDate#
- obfuscateRefSource#
- region#
- sameLengthFormattedEntities#
- seed#
- unnormalizedDateMode#
- useShiftDays#
- getSelectiveObfuscateRefSource()#
Returns the dictionary of entity names to their obfuscate ref sources.
- setAdditionalDateFormats(formats: list)#
Sets additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default date formats.
- Parameters:
formats (list[str]) – List of additional date formats to be considered during date obfuscation.
- setAgeRanges(value: list)#
Sets list of integer specifying limits of the age groups to preserve during obfuscation.
- Parameters:
value (List[int]) – List of integer specifying limits of the age groups to preserve during obfuscation.
- setAgeRangesByHipaa(value: bool)#
Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.
- Parameters:
value (bool) – If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.
- setConsistentAcrossNameParts(value: bool)#
Sets whether to enforce consistent obfuscation across name parts, even when they appear separately. When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.
- For example, if “John Smith” is obfuscated as “Liam Brown”, then:
When the full name “John Smith” appears, it will be replaced with “Liam Brown”
When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation.
Default: True
- Parameters:
value (bool) – Whether to enforce consistent obfuscation across name parts.
- setCountryObfuscation(value: bool)#
Sets whether to obfuscate country entities or not. If True, country entities will be obfuscated. If False, country entities will not be obfuscated.
- Parameters:
value (bool) – Whether to obfuscate country entities or not. Default: False.
- setDateEntities(entities: list)#
Sets list of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’, ‘EFFDATE’, ‘FISCAL_YEAR’]
- Parameters:
entities (list[str]) – List of date entities.
- setDateFormats(formats: list)#
Sets list of date formats to automatically displace if parsed
- Parameters:
formats (list[str]) – List of date formats to automatically displace if parsed
- setDays(day: int)#
Sets number of days to obfuscate by displacement the dates. If not provided a random integer between 1 and 60 will be used.
- Parameters:
day (int) – Number of days to obfuscate by displacement the dates.
- setFakerLengthOffset(value)#
It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Must be greater than 0. Default: 3
- Parameters:
value (int) – Integer value to specify length deviation.
- setFixedMaskLength(length)#
The length of the masking sequence in case of fixed_length_chars masking policy. Default: 7 :param length: The length of the masking sequence in case of fixed_length_chars masking policy. :type length: int
- setGenderAwareness(value: bool)#
Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- Parameters:
value (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- setGeoConsistency(value: bool)#
Sets whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone.
Functionality Overview#
This parameter enables intelligent geographical entity obfuscation that maintains realistic relationships between different geographic components. When enabled, the system ensures that obfuscated addresses form coherent, valid combinations rather than random replacements.
Supported Entity Types#
The following geographical entities are processed with priority order:
state (Priority: 0) - US state names
city (Priority: 1) - City names
zip (Priority: 2) - Zip codes
street (Priority: 3) - Street addresses
phone (Priority: 4) - Phone numbers
Language Requirement#
IMPORTANT: Geographic consistency is only applied when:
geoConsistency
parameter is set toTrue
ANDlanguage
parameter is set to"en"
For non-English configurations, this feature is automatically disabled regardless of the parameter setting.
Consistency Algorithm#
When geographical entities comes from the chunk columns:
Entity Grouping: All geographic entities are identified and grouped by type
Fake Address Selection: A consistent set of fake US addresses is selected using hash-based deterministic selection to ensure reproducibility
Priority-Based Mapping: Entities are mapped to fake addresses following the priority order (state → city → zip → street → phone)
Consistent Replacement: All entities of the same type within a document use the same fake address pool, maintaining geographical coherence
Parameter Interactions#
IMPORTANT: Enabling this parameter automatically disables:
keepTextSizeForObfuscation
- Text size preservation is not maintainedconsistentObfuscation
- Standard consistency rules are overriddenfile-based fakers
This is necessary because geographic consistency requires specific fake address selection that may not preserve original text lengths or follow standard obfuscation patterns.
Examples
Basic usage:
>>> from sparknlp_jsl.annotator import DeIdentification >>> deid = DeIdentification() \ ... .setInputCols(["sentence", "token", "ner_chunk"]) \ ... .setOutputCol("deidentified") \ ... .setGeoConsistency(True) \ ... .setLanguage("en")
- param value:
Whether to enforce consistent obfuscation across geographical entities. Default is False.
- type value:
bool
- setKeepMonth(value: bool)#
Sets whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
- Parameters:
value (bool) – Whether to keep the month intact when obfuscating date entities.
- setKeepTextSizeForObfuscation(value: bool)#
It specifies whether the output should maintain the same character length as the input text. If True, the output text will remain the same if same length is available, else length might vary. If False, the output will be completely random. Default: False
- Parameters:
value (bool) – Whether to keep the text length same obfuscating entities.
- setKeepYear(value: bool)#
Sets whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
- Parameters:
value (bool) – Whether to keep the year intact when obfuscating date entities.
- setLanguage(lang: str)#
The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
- Parameters:
lang (str) – The language used to select the regex file and some faker entities. Default:’en’.
- setMaskingPolicy(mask: str)#
- Sets the masking policy:
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets,
being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.
entity_labels: Replace the values with the corresponding entity labels.
fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
entity_labels_without_brackets: Replace the values with the entity value without brackets.
same_length_chars_without_brackets: Replace the name with the asterisks with same length without brackets.
- Parameters:
mask (str) – The masking policy
- setMode(mode: str)#
Sets mode for Anonymizer [‘mask’|’obfuscate’]
- Parameters:
mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]
- setObfuscateDate(value: bool)#
When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to
True
, make sure dateFormats param fits the needs. If the value isTrue
and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting toFalse
, then the date will be masked to <DATE>. Default: False- Parameters:
value (bool) – When mode==’obfuscate’ whether to obfuscate dates or not. Default: False.
- setObfuscateRefSource(source: str)#
The source of obfuscation to obfuscate the entities. For dates entities, This property is invalid. The values ar the following:
custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly.
- Parameters:
source (str) – The source of obfuscation to obfuscate the entities. Default: faker.
- setRegion(value: str)#
With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. For example: Decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. The values are following: ‘eu’ for European Union ‘us’ for the USA Default: ‘eu’
- Parameters:
value (str) – The region to use select date formats. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’
- setSameLengthFormattedEntities(value: list)#
Sets list of formatted entities to generate the same length outputs as original ones during obfuscation
- Parameters:
value (List[str]) –
are (The supported and default formatted entities) –
- setSeed(s)#
Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.
- Parameters:
s (int) – The seed to select the entities on obfuscate mode.
- setSelectiveObfuscateRefSource(source: dict)#
A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source.
Example:#
>>> selective_sources = { ... 'PHONE': 'file', ... 'ADDRESS': 'both' ... } >>> deid.setObfuscateRefSource('faker').setSelectiveObfuscateRefSource(selective_sources)
- param source:
A dictionary of entity names to their obfuscation modes. The keys are entity names and the values are the obfuscation sources.
- type source:
dict[str, str]
- setStaticObfuscationPairs(pairs: list)#
Sets the static obfuscation pairs This method is used to set static obfuscation pairs that will be used for de-identification. Each pair should contain three elements: original, entity type, and fake. The pairs must have exactly 3 elements: [original, entityType, fake].
Example:#
>>> pairs = [ ... ["John Doe", "PERSON", "Jane Smith"], ... ["Los Angeles", "LOCATION", "New York City"], ... ]
- param pairs:
List of static obfuscation pairs. Each pair should contain three elements: [original, entityType, fake]. The pairs must have exactly 3 elements: [original, entityType, fake].
- type pairs:
list
- setUnnormalizedDateMode(mode: str)#
Sets the mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.
- Parameters:
mode (str) – The mode to use if the date is not formatted.
- setUseShiftDays(value: bool)#
Sets if you want to use the random shift day when the document has this in its metadata. Default: False
- Parameters:
value (bool) – Whether to use the random shift day when the document has this in its metadata. Default: False