sparknlp_jsl.annotator.deid.deidentication_params#

Module Contents#

Classes#

DeIdentificationParams

It is a base class that contains all the params that are common between DeIdentificationModel

class DeIdentificationParams#

It is a base class that contains all the params that are common between DeIdentificationModel and DeIdentification annotators.

Parameters:
  • mode – Mode for Anonimizer [‘mask’|’obfuscate’]

  • obfuscateDate – When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs (default: false) If the value is true and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting to ‘false’, then the date will be masked to <DATE> Default: false

  • dateTag – Tag representing dates in the obfuscate reference file (default: DATE)

  • days – A number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used.

  • dateToYear – True if we want the model to transform dates into years, False otherwise.

  • minYear – Minimum year to be used when transforming dates into years. Default: ‘1929’

  • dateFormats – List of date formats to automatically displace if parsed

  • consistentObfuscation – Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

  • sameEntityThreshold – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

  • obfuscateRefSource – The source of obfuscation of to obfuscate the entities. For dates entities, This property is invalid. The values ar the following: file: Takes the entities from the obfuscatorRefFile faker: Takes the entities from the Faker module both : Takes the entities from the obfuscatorRefFile and the faker module randomly.

  • regexOverride – If is true prioritize the regex entities, if is false prioritize the ner.

  • language – The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.

  • seed – It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

  • ignoreRegex – Select if you want to use the regex file loaded in the model. If true the default regex file will be not used. The default value is false.

  • isRandomDateDisplacement – Use a random displacement days in dates entities, that random number is based on the seed parameter. If true use random displacement days in dates entities, if false use the days parameter. The default value is false.

  • mappingsColumn – This is the mapping column that will return the Annotations chunks with the fake entities.

  • returnEntityMappings – With this property, you can select if you want to return mapping column

  • blackList – List of entities that will be ignored in the regex rules. The rest will be processed. The default values are “IBAN”,”ZIP”,”NPI”,”DLN”,”PASSPORT”,”C_CARD”,”DEA”,”SSN”, “IP”, “DEA”.

  • blackListEntities – List of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Default: []

  • maskingPolicy

    Select the masking policy

    • same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets,being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.

    • entity_labels: Replace the values with the corresponding entity labels.

    • fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.

  • sameLengthFormattedEntities – List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE, CONTACT, ACCOUNT.

  • genderAwareness – Whether to use gender-aware names or not during obfuscation. This param effects only names. If the value is true, it might decrease performance. Default: False

  • ageRangesByHipaa – Whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule. The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged. If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.

  • obfuscationStrategyOnException – Sets the obfuscation strategy to be applied when an exception occurs. Four possible values are supported: - “mask”: The original chunk is replaced with a masking pattern. - “default”: The original chunk is replaced with a default faker. - “skip”: The original chunk is not replaced with any faker. - “exception”: Throws the exception. The default obfuscation strategy is “default”.

  • metadataMaskingPolicy – If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return the masked form in the metadata: - ‘entity_labels’: Replace the values with the entity value. - ‘same_length_chars’: Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets. - ‘fixed_length_chars’: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk. - Default: “”

  • obfuscateByAgeGroups – Whether to obfuscate ages based on age groups. When True, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When False, the age ranges specified in the ageRanges parameter will be used to obfuscate ages. Default: False

  • ageGroups (dict[str, List[int]]) – A dictionary of age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges parameter. The dictionary should contain the age group name as the key and an list of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language: >>> { “baby”: [0, 1], … “toddler”: [1, 4], … “child”: [4, 13], … “teenager”: [13, 20], … “adult”: [20, 65], … “senior”: [65, 100] }

  • keepYear (bool) – Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.

ageRanges#
ageRangesByHipaa#
blackList#
blackListEntities#
consistentObfuscation#
dateFormats#
dateTag#
dateToYear#
days#
fixedMaskLength#
genderAwareness#
ignoreRegex#
isRandomDateDisplacement#
keepYear#
language#
mappingsColumn#
maskingPolicy#
metadataMaskingPolicy#
minYear#
mode#
obfuscateByAgeGroups#
obfuscateDate#
obfuscateRefSource#
obfuscationStrategyOnException#
outputAsDocument#
regexOverride#
region#
returnEntityMappings#
sameEntityThreshold#
sameLengthFormattedEntities#
seed#
unnormalizedDateMode#
useShifDays#
zipCodeTag#
getBlackList()#

Gets the value of blackList or its default value.

getSameLengthFormattedEntities()#

Returns the sameLengthFormattedEntities value.

getUseShiftDays()#

Return the useShiftDays value.

setAgeGroups(value: dict)#

Sets a dictionary of age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges parameter. The dictionary should contain the age group name as the key and an list of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:

Default and example dictionary#

>>> { "baby": [0, 1],
...   "toddler": [1, 4],
...   "child": [4, 13],
...   "teenager": [13, 20],
...   "adult": [20, 65],
...   "senior": [65, 100] }
param value:

A dictionary of age groups to obfuscate ages.

type value:

dict[str, List[int]]

setAgeRanges(s)#

Sets list of integers specifying limits of the age groups to preserve during obfuscation

Parameters:

s (List[str]) –

setAgeRangesByHipaa(value: bool)#

Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.

Parameters:

value (bool) – If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.

setBlackList(s)#

Sets a list of entities that will be ignored in the regex file. The rest will be processed. The default values are “IBAN”,”ZIP”,”NPI”,”DLN”,”PASSPORT”,”C_CARD”,”DEA”,”SSN”, “IP”, “DEA”. :param s: List of entities that will be ignored in the regex file. The rest will be processed. :type s: list :param The default values are “IBAN”: :param “ZIP”: :param “NPI”: :param “DLN”: :param “PASSPORT”: :param “C_CARD”: :param “DEA”: :param “SSN”: :param “IP”: :param “DEA”.:

setBlackListEntities(value)#

Sets a list of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Default: []

Parameters:

value (list) – List of entities coming from NER or regex rules that will be ignored for masking or obfuscation.

setConsistentObfuscation(s)#

Sets whether to replace very similar entities in a document with the same randomized term (default: true). The similarity is based on the Levenshtein Distance between the words.

Parameters:

s (str) – Whether to replace very similar entities in a document with the same randomized term. The similarity is based on the Levenshtein Distance between the words.

setDateFormats(s)#

Sets list of date formats to automatically displace if parsed

Parameters:

s (str) – List of date formats to automatically displace if parsed

setDateTag(tag: str)#

Set Tag representing what are the NER entity (default: DATE)

Parameters:

tag (str) – Tag representing what are the NER entity (default: DATE)

setDateToYear(s)#

Sets transform dates into years.

Parameters:

s (bool) – True if we want the model to transform dates into years, False otherwise.

setDays(d)#

Sets number of days to obfuscate by displacement the dates.

Parameters:

d (int) – Number of days to obfuscate by displacement the dates.

setFixedMaskLength(length)#

Fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking policy is selected.

Parameters:

length (int) – The mask length

setGenderAwareness(l)#

Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Parameters:

l (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

setIgnoreRegex(s)#

Sets whether you want to use regex or not. If the value is ‘True’, it can increase performance but might decrease accuracy. Default: False.

Parameters:
  • s (bool) – Whether to use regex or not. If the value is ‘True’,

  • Default (it can increase performance but might decrease accuracy.) –

setIsRandomDateDisplacement(s)#

Sets if you want to use random displacement in dates

Parameters:

s (bool) – Boolean value to select if you want to use random displacement in dates

setKeepYear(value: bool)#

Sets whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.

Parameters:

value (bool) – Whether to keep the year intact when obfuscating date entities.

setLanguage(lang: str)#

The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’

Parameters:

lang (str) – The language used to select the regex file and some faker entities.’en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.

setMappingsColumn(s)#

Sets the name of mapping column that will return the Annotations chunks with the fake entities

Parameters:

s (str) – Mapping column that will return the Annotations chunks with the fake entities

setMaskingPolicy(mask: str)#

Sets the masking policy

  • same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.

  • entity_labels: Replace the values with the corresponding entity labels.

  • fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.

Parameters:

mask (str) – The masking policy

setMetadataMaskingPolicy(value: str)#

Sets metadata masking policy. If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:

  • ‘entity_labels’: Replace the values with the entity value.

  • ‘same_length_chars’: Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.

  • ‘fixed_length_chars’: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.

  • Default: “”

Parameters:

value (str) – If specified, the metadata includes the masked form of the document.

setMinYear(s)#

Sets minimum year to be used when transforming dates into years. Default: ‘1929’

Parameters:

s (int) – Minimum year to be used when transforming dates into years. Default: ‘1929’

setMode(mode: str)#

Sets mode for Anonymizer [‘mask’|’obfuscate’]

Parameters:

mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]

setObfuscateByAgeGroups(value: bool)#

Sets whether to obfuscate ages based on age groups. When True, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When False, the age ranges specified in the ageRanges parameter will be used to obfuscate ages.

Parameters:

value (bool) – Whether to obfuscate ages based on age groups.

setObfuscateDate(value)#

When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs. If the value is true and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting to false, then the date will be masked to <DATE> Default: false

Parameters:

value (str) – When mode==’obfuscate’ whether to obfuscate dates or not. Default: false

setObfuscateRefSource(s)#

Sets mode for select obfuscate source [‘both’|’faker’| ‘file]

Parameters:

s (str) – Mode for select obfuscate source [‘both’|’faker’| ‘file]

setObfuscationStrategyOnException(value: str)#

Sets the obfuscation strategy to be applied when an exception occurs. Four possible values are supported:

  • “mask”: The original chunk is replaced with a masking pattern.

  • “default”: The original chunk is replaced with a default faker.

  • “skip”: The original chunk is not replaced with any faker.

  • “exception”: Throws the exception.

The default obfuscation strategy is “default”.

Parameters:

value (str) – The obfuscation strategy to set. Should be one of [“mask”, “skip”, “default”, “exception”].

setOutputAsDocument(l)#

Set whether to return all sentences joined into a single document

Parameters:

l (str) – Whether to return all sentences joined into a single document

setRegexOverride(s)#

Sets whether to prioritize regex over ner entities. If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false.

Parameters:

s (bool) – Whether to prioritize regex over ner entities

setRegion(s)#

With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. For example: Decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. The values are following: ‘eu’ for European Union ‘us’ for the USA Default: ‘eu’

Parameters:

s (str) – The region to use select date formats. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’

setReturnEntityMappings(s)#

Sets if you want to return mapping column

Parameters:

s (bool) – Whether to return the mapping column.

setSameEntityThreshold(s)#

Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

Parameters:

s (float) – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

setSameLengthFormattedEntities(s)#

Sets list of formatted entities to generate the same length outputs as original ones during obfuscation

Parameters:
  • s (List[str]) –

  • are (The supported and default formatted entities) –

setSeed(s)#

Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Parameters:

s (int) – The seed to select the entities on obfuscate mode.

setUnnormalizedDateMode(s)#

Sets the mode to use if the date is not formatted.

Parameters:

s (str) – The mode to use if the date is not formatted. [mask, obfuscate, skip] Default: obfuscate

setUseShifDays(s)#
setUseShiftDays(s)#

Sets if you want to use the random shift day when the document has this in its metadata. Default: False

Parameters:

s (bool) – Whether to use the random shift day when the document has this in its metadata. Default: False

setZipCodeTag(tag: str)#

Tag representing zip codes in the obfuscate reference file (default: ZIP)

Parameters:

tag (str) – Tag representing zip codes in the obfuscate reference file (default: ZIP)