`sparknlp_jsl.legal.chunk_classification.deid.deIdentification`#

Module Contents#

Classes#

`DeIdentification`	Contains all the methods for training a DeIdentificationModel model.
`DeIdentificationModel`	The DeIdentificationModel model can obfuscate or mask the entities that contains personal information.

class DeIdentification#

Bases: sparknlp_jsl.annotator.deid.DeIdentification

Contains all the methods for training a DeIdentificationModel model.

This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

Input Annotation types	Output Annotation type
`DOCUMENT, CHUNK, TOKEN`	`DOCUMENT`

Parameters:

regexPatternsDictionary – Dictionary with regular expression patterns that match some protected entity
mode – Mode for Anonimizer [‘mask’|’obfuscate’]
obfuscateDate – When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs (default: false)
obfuscateRefFile – File with the terms to be used for Obfuscation
refFileFormat – Format of the reference file
refSep – Sep character in refFile
dateTag – Tag representing dates in the obfuscate reference file (default: DATE)
days – Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used
dateToYear – True if we want the model to transform dates into years, False otherwise.
minYear – Minimum year to be used when transforming dates into years.
dateFormats – List of date formats to automatically displace if parsed
consistentObfuscation – Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.
sameEntityThreshold – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
obfuscateRefSource – The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The values ar the following: file: Takes the entities from the obfuscatorRefFile faker: Takes the entities from the Faker module both : Takes the entities from the obfuscatorRefFile and the faker module randomly.
regexOverride – If is true prioritize the regex entities, if is false prioritize the ner.
language – The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.
seed – It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut.
ignoreRegex – Select if you want to use regex file loaded in the model.If true the default regex file will be not used.The default value is false.
isRandomDateDisplacement – Use a random displacement days in dates entities,that random number is based on the [[DeIdentificationParams.seed]] If true use random displacement days in dates entities,if false use the [[DeIdentificationParams.days]] The default value is false.
mappingsColumn – This is the mapping column that will return the Annotations chunks with the fake entities.
returnEntityMappings – With this property you select if you want to return mapping column
blackList – List of entities ignored for masking or obfuscation.The default values are: “SSN”,”PASSPORT”,”DLN”,”NPI”,”C_CARD”,”IBAN”,”DEA”
sameLengthFormattedEntities – List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, ZIP, VIN, SSN, DLN, LICENSE, PLATE
maskingPolicy – Select the masking policy: - same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. - entity_labels: Replace the values with the corresponding entity labels. - fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
selectiveObfuscationModesPath –

Dictionary path where is the json that contains the selective obfuscation modes
’obfuscate’: Replace the values with random values. ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end. ‘mask_entity_labels’: Replace the values with the entity value. ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()” ‘skip’: Skip the values (intact)

The entities which have not been given in dictionary will deidentify according to setMode()
entityCasingModesPath –

Dictionary path where is the json that contains the entity casing modes.
’lowercase’: Converts all characters to lower case using the rules of the default locale. ‘uppercase’: Converts all characters to upper case using the rules of the default locale. ‘capitalize’: Converts the first character to upper case and converts others to lower case. ‘titlecase’: Converts the first character in every token to upper case and converts others to lower case.
genderAwareness – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
keepYear (bool) – Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
keepMonth (bool) – Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
...
>>>  sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence") \
...     .setUseAbbreviations(True)
...
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
...
>> embeddings = WordEmbeddingsModel \
...     .pretrained("embeddings_clinical", "en", "clinical/models") \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings")
...
 Ner entities
>>> clinical_sensitive_entities = MedicalNerModel \
...     .pretrained("ner_deid_enriched", "en", "clinical/models") \
...     .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
...
>>> nerConverter = NerConverter() \
...     .setInputCols(["sentence", "token", "ner"]) \
...     .setOutputCol("ner_con")
 Deidentification
>>> deIdentification = DeIdentification() \
...     .setInputCols(["ner_chunk", "token", "sentence"]) \
...     .setOutputCol("dei") \
...     # file with custom regex pattern for custom entities\
...     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
...     # file with custom obfuscator names for the entities\
...     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
...     .setRefFileFormat("csv") \
...     .setRefSep("#") \
...     .setMode("obfuscate") \
...     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
...     .setObfuscateDate(True) \
...     .setDateTag("DATE") \
...     .setDays(5) \
...     .setObfuscateRefSource("file")
Pipeline
>>> data = spark.createDataFrame([
...     ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
...     ]).toDF("text")
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     sentenceDetector,
...     tokenizer,
...     embeddings,
...     clinical_sensitive_entities,
...     nerConverter,
...     deIdentification
... ])
>>> result = pipeline.fit(data).transform(data)
>>> result.select("dei.result").show(truncate = False)
 +--------------------------------------------------------------------------------------------------+
 |result                                                                                            |
 +--------------------------------------------------------------------------------------------------+
 |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
 +--------------------------------------------------------------------------------------------------+

additionalDateFormats#

ageRanges#

ageRangesByHipaa#

blackList#

blackListEntities#

combineRegexPatterns#

consistentAcrossNameParts#

consistentObfuscation#

countryObfuscation#

dateEntities#

dateFormats#

dateTag#

dateToYear#

days#

deidMarkers#

doExceptionHandling#

enableDefaultObfuscationEquivalents#

entityCasingModesPath#

fakerLengthOffset#

fixedMaskLength#

genderAwareness#

geoConsistency#

getter_attrs = []#

groupByCol#

ignoreRegex#

inputAnnotatorTypes#

inputCols#

isRandomDateDisplacement#

keepMonth#

keepTextSizeForObfuscation#

keepYear#

language#

lazyAnnotator#

mappingsColumn#

maskingPolicy#

maxRandomDisplacementDays#

metadataMaskingPolicy#

minYear#

mode#

name = 'DeIdentification'#

obfuscateByAgeGroups#

obfuscateDate#

obfuscateRefFile#

obfuscateRefSource#

obfuscateZipByHipaa#

obfuscateZipKeepDigits#

obfuscationEquivalentsResource#

obfuscationStrategyOnException#

optionalInputAnnotatorTypes = []#

outputAnnotatorType = 'document'#

outputAsDocument#

outputCol#

refFileFormat#

refSep#

regexOverride#

regexPatternsDictionary#

regexPatternsDictionaryAsJsonString#

region#

returnEntityMappings#

sameEntityThreshold#

sameLengthFormattedEntities#

seed#

selectiveObfuscationModesPath#

skipLPInputColsValidation = True#

staticObfuscationPairsResource#

uid = ''#

unnormalizedDateMode#

useShifDays#

useShiftDays#

zipCodeTag#

clear(param: pyspark.ml.param.Param) → None#: Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) → JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:: extra (dict, optional) – Extra parameters to copy to the new instance
Returns:: Copy of this instance
Return type:: JavaParams

explainParam(param: str | Param) → str#: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() → str#: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) → pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:: extra (dict, optional) – extra param values
Returns:: merged param map
Return type:: dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) → M#

fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) → List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:

dataset (pyspark.sql.DataFrame) – input dataset.
params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) → Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:

dataset (pyspark.sql.DataFrame) – input dataset.
paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getChunkMatching()#: Gets the value of chunkMatching or its default value.

getDefaultObfuscationEquivalents()#: Returns the default obfuscation equivalents for common entities.

getDeidMarkers()#: Returns the current deid markers as a tuple (prefix, suffix).

getInputCols()#: Gets current column names of input annotations.

getLazyAnnotator()#: Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) → Any#
getOrDefault(param: Param[T]) → T: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#: Gets output column name of annotations.

getParam(paramName: str) → Param#: Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:: paramName (str) – Name of the parameter

getSelectiveObfuscateRefSource()#: Returns the dictionary of entity names to their obfuscate ref sources.

getUseShiftDays()#: Return the useShiftDays value.

hasDefault(param: str | Param[Any]) → bool#: Checks whether a param has a default value.

hasParam(paramName: str) → bool#: Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#

isDefined(param: str | Param[Any]) → bool#: Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) → bool#: Checks whether a param is explicitly set by user.

classmethod load(path: str) → RL#: Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#: Returns an MLReader instance for this class.

save(path: str) → None#: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) → None#: Sets a parameter in the embedded param map.

setAdditionalDateFormats(formats: list)#

Sets additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default date formats.

Parameters:: formats (list[str]) – List of additional date formats to be considered during date obfuscation.

setAgeGroups(value: dict)#

Sets a dictionary of age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges parameter. The dictionary should contain the age group name as the key and an list of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:

Default and example dictionary#

>>> { "baby": [0, 1],
...   "toddler": [1, 4],
...   "child": [4, 13],
...   "teenager": [13, 20],
...   "adult": [20, 65],
...   "senior": [65, 100] }

param value:: A dictionary of age groups to obfuscate ages.
type value:: dict[str, List[int]]

setAgeRanges(value: list)#

Sets list of integer specifying limits of the age groups to preserve during obfuscation.

Parameters:: value (List[int]) – List of integer specifying limits of the age groups to preserve during obfuscation.

setAgeRangesByHipaa(value: bool)#

Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.

Parameters:: value (bool) – If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.

setBlackList(s)#

Sets a list of entities that will be ignored in the regex file. The rest will be processed. The default values are: “IBAN”, “ZIP”, “NPI”, “DLN”, “PASSPORT”, “C_CARD”, “DEA”, “SSN”, “IP”, “DEA”.

Parameters:: s (list) – List of entities that will be ignored in the regex file. The rest will be processed.

setBlackListEntities(value)#

Sets a list of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Default: []

Parameters:: value (list) – List of entities coming from NER or regex rules that will be ignored for masking or obfuscation.

setChunkMatching(chunkMatching)#

Sets entity chunk matching configuration for de-identification pipelines.

This method enables matching of entity chunks (e.g., “NAME”, “DATE”) across multiple rows or within grouped data. Useful when certain entity labels are missing in some rows and need to be inferred from others in the same group.

Parameters:: chunkMatching (dict[str, float]) – A dictionary specifying entity labels and associated confidence thresholds for chunk matching logic.

Notes

When applying the method across multiple rows, the groupByCol parameter is required.

setCombineRegexPatterns(value)#

Sets whether you want to use regex both loaded regex file and default regex file.

If the value is ‘True’, both file will be used. If the value is ‘False’, either loaded file or default file will be used Default: False.

Parameters:: value (bool) – Whether to combine regex files or not. If the value is ‘True’, both file will be used. Default: False.

setConsistentAcrossNameParts(value: bool)#

Sets whether to enforce consistent obfuscation across name parts, even when they appear separately. When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.

For example, if “John Smith” is obfuscated as “Liam Brown”, then:

When the full name “John Smith” appears, it will be replaced with “Liam Brown”
When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation.

Default: True

Parameters:: value (bool) – Whether to enforce consistent obfuscation across name parts.

setConsistentObfuscation(s)#

Sets whether to replace very similar entities in a document with the same randomized term (default: true). The similarity is based on the Levenshtein Distance between the words.

Parameters:: s (str) – Whether to replace very similar entities in a document with the same randomized term. The similarity is based on the Levenshtein Distance between the words.

setCountryObfuscation(value: bool)#

Sets whether to obfuscate country entities or not. If True, country entities will be obfuscated. If False, country entities will not be obfuscated.

Parameters:: value (bool) – Whether to obfuscate country entities or not. Default: False.

setDateEntities(entities: list)#

Sets list of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’, ‘EFFDATE’, ‘FISCAL_YEAR’]

Parameters:: entities (list[str]) – List of date entities.

setDateFormats(formats: list)#

Sets list of date formats to automatically displace if parsed

Parameters:: formats (list[str]) – List of date formats to automatically displace if parsed

setDateTag(tag: str)#

Set Tag representing what are the NER entity (default: DATE)

Parameters:: tag (str) – Tag representing what are the NER entity (default: DATE)

setDateToYear(s)#

Sets transform dates into years.

Parameters:: s (bool) – True if we want the model to transform dates into years, False otherwise.

setDays(day: int)#

Sets number of days to obfuscate by displacement the dates. If not provided a random integer between 1 and 60 will be used.

Parameters:: day (int) – Number of days to obfuscate by displacement the dates.

setDeidMarkers(value)#

Sets the markers used to wrap deidentified entities in the output text.

The first element is used as the prefix marker and the second as the suffix marker. For example, setting markers to (“<DEID>”, “</DEID>”) will transform “John Doe” into “<DEID>John Doe</DEID>” in the deidentified text.

Defaults to (“”, “”), meaning no markers are added.

Parameters:: value (list, tuple, or dict) – A pair of markers (prefix, suffix) as a list/tuple, or a dict with ‘start’ (prefix) and/or ‘end’ (suffix) keys. If only one key is provided, the other defaults to “”. Valid dict keys are ‘start’ and ‘end’ only.

setDoExceptionHandling(value: bool)#

If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Parameters:: value (bool) – If True, exceptions are handled.

setEnableDefaultObfuscationEquivalents(value: bool)#

Sets whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default is False.

Parameters:: value (bool) – Whether to enable default obfuscation equivalents for common entities. Default is False.

setEntityCasingModes(path)#

Sets a Json path which has a dictionary of modes to select casing modes.

‘lowercase’: Converts all characters to lower case using the rules of the default locale. ‘uppercase’: Converts all characters to upper case using the rules of the default locale. ‘capitalize’: Converts the first character to upper case and converts others to lower case. ‘titlecase’: Converts the first character in every token to upper case and converts others to lower case.

Parameters:: path (str) – Dictionary path where is the json that contains the entity casing modes.

setFakerLengthOffset(value)#

It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Must be greater than 0. Default: 3

Parameters:: value (int) – Integer value to specify length deviation.

setFixedMaskLength(length)#: The length of the masking sequence in case of fixed_length_chars masking policy. Default: 7 :param length: The length of the masking sequence in case of fixed_length_chars masking policy. :type length: int

setForceInputTypeValidation(etfm)#

setGenderAwareness(value: bool)#

Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Parameters:: value (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

setGeoConsistency(value: bool)#

Sets whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone.

Functionality Overview#

This parameter enables intelligent geographical entity obfuscation that maintains realistic relationships between different geographic components. When enabled, the system ensures that obfuscated addresses form coherent, valid combinations rather than random replacements.

Supported Entity Types#

The following geographical entities are processed with priority order:

state (Priority: 0) - US state names
city (Priority: 1) - City names
zip (Priority: 2) - Zip codes
street (Priority: 3) - Street addresses
phone (Priority: 4) - Phone numbers

Language Requirement#

IMPORTANT: Geographic consistency is only applied when:

geoConsistency parameter is set to True AND
language parameter is set to "en"

For non-English configurations, this feature is automatically disabled regardless of the parameter setting.

Consistency Algorithm#

When geographical entities comes from the chunk columns:

Entity Grouping: All geographic entities are identified and grouped by type
Fake Address Selection: A consistent set of fake US addresses is selected using hash-based deterministic selection to ensure reproducibility
Priority-Based Mapping: Entities are mapped to fake addresses following the priority order (state → city → zip → street → phone)
Consistent Replacement: All entities of the same type within a document use the same fake address pool, maintaining geographical coherence

Parameter Interactions#

IMPORTANT: Enabling this parameter automatically disables:

keepTextSizeForObfuscation - Text size preservation is not maintained
consistentObfuscation - Standard consistency rules are overridden
file-based fakers

This is necessary because geographic consistency requires specific fake address selection that may not preserve original text lengths or follow standard obfuscation patterns.

Examples

Basic usage:

>>> from sparknlp_jsl.annotator import DeIdentification
>>> deid = DeIdentification() \
...     .setInputCols(["sentence", "token", "ner_chunk"]) \
...     .setOutputCol("deidentified") \
...     .setGeoConsistency(True) \
...     .setLanguage("en")

param value:: Whether to enforce consistent obfuscation across geographical entities. Default is False.
type value:: bool

setGroupByCol(value)#

Sets the column name used to group the dataset. This parameter is used in conjunction with consistentObfuscation to ensure consistent obfuscation within each group. When groupByCol is set, the dataset is partitioned into groups based on the values of the specified column.

Default: “” (empty string, meaning no grouping)

Notes

The column name must be a valid string in the input dataset.
The column must be of StringType.
This functionality can change order of the dataset, so it is recommended to use it with caution.
This functionality cannot be supported by LightPipeline.

Parameters:: value (str) – The column name used to group the dataset.

setIgnoreRegex(s)#

Sets whether you want to use regex or not. If the value is ‘True’, it can increase performance but might decrease accuracy. Default: False.

Parameters:

s (bool) – Whether to use regex or not. If the value is ‘True’,
Default (it can increase performance but might decrease accuracy.) –

setInputCols(*value)#

Sets column names of input annotations.

Parameters:: *value (List[str]) – Input columns for the annotator

setIsRandomDateDisplacement(s)#

Sets if you want to use random displacement in dates

Parameters:: s (bool) – Boolean value to select if you want to use random displacement in dates

setKeepMonth(value: bool)#

Sets whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.

Parameters:: value (bool) – Whether to keep the month intact when obfuscating date entities.

setKeepTextSizeForObfuscation(value: bool)#

It specifies whether the output should maintain the same character length as the input text. If True, the output text will remain the same if same length is available, else length might vary. If False, the output will be completely random. Default: False

Parameters:: value (bool) – Whether to keep the text length same obfuscating entities.

setKeepYear(value: bool)#

Sets whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.

Parameters:: value (bool) – Whether to keep the year intact when obfuscating date entities.

setLanguage(lang: str)#

The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’

Parameters:: lang (str) – The language used to select the regex file and some faker entities. Default:’en’.

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:: value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMappingsColumn(value: str)#

Sets the name of mapping column that will return the Annotations chunks with the fake entities. You can change the name of the column with this property. Default is ‘aux’.

Parameters:: value (str) – Mapping column that will return the Annotations chunks with the fake entities

setMaskingPolicy(mask: str)#

Sets the masking policy:

same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets,

being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.

entity_labels: Replace the values with the corresponding entity labels.
fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
entity_labels_without_brackets: Replace the values with the entity value without brackets.
same_length_chars_without_brackets: Replace the name with the asterisks with same length without brackets.

Parameters:: mask (str) – The masking policy

setMaxRandomDisplacementDays(days: int)#

Sets maximum number of days for random date displacement. Default is 1825.

Parameters:: days (int) – Maximum number of days for random date displacement.

setMetadataMaskingPolicy(value: str)#

Sets metadata masking policy. If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:

‘entity_labels’: Replace the values with the entity value.
‘same_length_chars’: Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
‘fixed_length_chars’: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
‘same_length_chars_without_brackets’: Replace the name with the asterix with same length without brackets.
‘entity_labels_without_brackets’: Replace the values with the entity value without brackets.
Default: “”

Parameters:: value (str) – If specified, the metadata includes the masked form of the document.

setMinYear(s)#

Sets minimum year to be used when transforming dates into years. Default: ‘1929’

Parameters:: s (int) – Minimum year to be used when transforming dates into years. Default: ‘1929’

setMode(mode: str)#

Sets mode for Anonymizer [‘mask’|’obfuscate’]

Parameters:: mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]

setObfuscateByAgeGroups(value: bool)#

Sets whether to obfuscate ages based on age groups. When True, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When False, the age ranges specified in the ageRanges parameter will be used to obfuscate ages.

Parameters:: value (bool) – Whether to obfuscate ages based on age groups.

setObfuscateDate(value: bool)#

When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to True, make sure dateFormats param fits the needs. If the value is True and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting to False, then the date will be masked to <DATE>. Default: False

Parameters:: value (bool) – When mode==’obfuscate’ whether to obfuscate dates or not. Default: False.

setObfuscateRefFile(f)#

Set file with the terms to be used for Obfuscation

Parameters:: f (str) – File with the terms to be used for Obfuscation

setObfuscateRefSource(source: str)#

The source of obfuscation to obfuscate the entities. For dates entities, This property is invalid. The values ar the following:

custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly.

Parameters:: source (str) – The source of obfuscation to obfuscate the entities. Default: faker.

setObfuscateZipByHipaa(value: bool)#

Sets whether to apply HIPAA Safe Harbor ZIP code obfuscation rules.

Behavior#

True:
Apply HIPAA Safe Harbor rules for ZIP/ZIP+4 codes:
Extract the first five digits from the input (accepting formats like “12345”, “12345-6789”, “123456789”, and other tolerant forms).

If the first three-digit ZIP prefix is in the HIPAA restricted list (the 17 prefixes derived from 2000 Census data), the ZIP is suppressed to the canonical value “000**”.

Otherwise, the ZIP is generalized to the first three digits followed by “**” (i.e. XXX**). The +4 portion will be masked with asterisks if present.
False:
HIPAA-specific ZIP masking is not applied. Instead, the component’s default or user-defined ZIP obfuscation rules will be used.

param value:: If True, apply HIPAA Safe Harbor ZIP obfuscation rules. If False, skip HIPAA-specific rules and use the default/custom ZIP obfuscation.
type value:: bool

setObfuscateZipKeepDigits(value: int)#

Sets the number of leading ZIP code digits to preserve when applying HIPAA-based ZIP code obfuscation.

This parameter is only effective when obfuscateZipByHipaa is enabled.

Behavior#

Preserves the first value digits of the ZIP code.

Masks all remaining digits— including any ZIP+4 portion—with asterisks (*).

Default: 3.

Allowed range: 0 to 5.

Examples

12345 → 123**
If value = 2 → 12***

This setting overrides the default HIPAA Safe Harbor ZIP generalization pattern (XXX**) and allows clients to customize how many leading digits remain unmasked, enabling expert-determination–based deidentification flows.

param value:: Number of ZIP digits to preserve before masking. Must be between 0 and 5 (inclusive).
type value:: int

setObfuscationEquivalents(equivalents)#

Sets variant-to-canonical entity mappings to ensure consistent obfuscation.

This function allows you to define equivalence rules for entity variants that should be obfuscated the same way. For example, the names “Alex” and “Alexander” will always be mapped to the same obfuscated value if they are linked to the same canonical form.

It accepts a list of string triplets, where each triplet defines:

variant: A non-standard, short, or alternative form of a value (e.g., “Alex”)
entityType: The type of the entity (e.g., “NAME”, “STATE”, “COUNTRY”)
canonical: The standardized form all variants map to (e.g., “Alexander”)

This is especially useful in de-identification tasks to ensure consistent replacement of semantically identical values. It also allows cross-variant normalization across different occurrences of sensitive data. Notes: —— Both variant and entityType comparisons are case-insensitive. For example, “alex”, “Alex”, and “ALEX” are treated as the same variant.

Example:#

equivalents = [: [“Alex”, “NAME”, “Alexander”], [“Rob”, “NAME”, “Robert”], [“CA”, “STATE”, “California”], [“Calif.”, “STATE”, “California”]

]

my_deid_transformer.setObfuscationEquivalents(equivalents)

param equivalents:: list of [variant, entityType, canonical] triplets
raises ValueError:: if any entry does not have exactly 3 elements
return:: self

setObfuscationEquivalentsResource(path, read_as=ReadAs.TEXT, options=None)#

Loads variant-to-canonical entity mappings from an external resource file.

This method provides an alternative way to set obfuscation equivalents by reading them from an external file (e.g., CSV or plain text). Each line in the file should contain a triplet in the form:

variant,entityType,canonical

For example:: Rob,NAME,Robert CA,STATE,California Calif.,STATE,California

This is useful when managing large lists of mappings outside of Python code.

Parameters:

path (str) – Path to the external resource file containing the mappings.
read_as (str, optional) – Reading mode for the resource (e.g., ReadAs.TEXT or ReadAs.SPARK), by default ReadAs.TEXT.
options (dict, optional) – Options for reading the resource. For CSV files, set {“delimiter”: “,”}. Default is {“delimiter”: “,”}.

Returns:

This transformer instance with the obfuscationEquivalentsResource parameter set.

Return type:

self

Notes

The resource file must have one mapping per line.
variant and entityType comparisons are case-insensitive during processing.

setObfuscationStrategyOnException(value: str)#

Sets the obfuscation strategy to be applied when an exception occurs. Four possible values are supported:

“mask”: The original chunk is replaced with a masking pattern.
“default”: The original chunk is replaced with a default faker.
“skip”: The original chunk is not replaced with any faker.
“exception”: Throws the exception.

The default obfuscation strategy is “default”.

Parameters:: value (str) – The obfuscation strategy to set. Should be one of [“mask”, “skip”, “default”, “exception”].

setOutputAsDocument(l)#

Set whether to return all sentences joined into a single document

Parameters:: l (str) – Whether to return all sentences joined into a single document

setOutputCol(value)#

Sets output column name of annotations.

Parameters:: value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:: paramName (str) – Name of the parameter

setRefFileFormat(f)#

Sets format of the reference file

Parameters:: f (str) – Format of the reference file

setRefSep(c)#

Sets separator character in refFile

Parameters:: c (str) – Separator character in refFile

setRegexOverride(s)#

Sets whether to prioritize regex over ner entities. If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false.

Parameters:: s (bool) – Whether to prioritize regex over ner entities

setRegexPatternsDictionary(path, read_as=ReadAs.TEXT, options=None)#

Sets dictionary with regular expression patterns that match some protected entity.

Parameters:

path (str) – Path where the dictionary is
read_as (ReadAs) – Format of the file
options (dict) – Dictionary with the options to read the file.

setRegexPatternsDictionaryAsJsonString(json)#

Sets dictionary with regular expression patterns as JSON that match some protected entity.

Parameters:: json (str) – regex(s) as JSON format.

setRegion(value: str)#

With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. For example: Decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. The values are following: ‘eu’ for European Union ‘us’ for the USA Default: ‘eu’

Parameters:: value (str) – The region to use select date formats. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’

setReturnEntityMappings(value: bool)#

Enable to return mapping column that contains the fake entities.

Parameters:: value (bool) – Whether to return the mapping column.

setSameEntityThreshold(s)#

Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

Parameters:: s (float) – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

setSameLengthFormattedEntities(value: list)#

Sets list of formatted entities to generate the same length outputs as original ones during obfuscation

Parameters:

value (List[str]) –
are (The supported and default formatted entities) –

setSeed(s)#

Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Parameters:: s (int) – The seed to select the entities on obfuscate mode.

setSelectiveObfuscateRefSource(source: dict)#

A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source.

Example:#

>>> selective_sources = {
... 'PHONE': 'file',
... 'ADDRESS': 'both'
... }
>>> deid.setObfuscateRefSource('faker').setSelectiveObfuscateRefSource(selective_sources)

param source:: A dictionary of entity names to their obfuscation modes. The keys are entity names and the values are the obfuscation sources.
type source:: dict[str, str]

setSelectiveObfuscationModes(path)#

Sets a Json path or dict directly which has a dictionary of modes to enable multi-mode deIdentification.

‘obfuscate’: Replace the values with random values. ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end. ‘mask_entity_labels’: Replace the values with the entity value. ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()” ‘mask_entity_labels_without_brackets’: Replace the values with the entity value without brackets. ‘mask_same_length_chars_without_brackets’: Replace the name with the asterisks with same length without brackets. ‘skip’: Skip the values (intact)

The entities which have not been given in dictionary will deidentify according to setMode().

Parameters:

path (dict or str) –

If dict: direct mapping of obfuscation modes to entity type lists, e.g., {“obfuscate”: [“NAME”, “STATE”]}
If str: path to an external file containing the mappings

Raises:

ValueError if the input is not a dictionary or string. –

setStaticObfuscationPairs(pairs: list)#

Sets the static obfuscation pairs This method is used to set static obfuscation pairs that will be used for de-identification. Each pair should contain three elements: original, entity type, and fake. The pairs must have exactly 3 elements: [original, entityType, fake].

Example:#

>>> pairs = [
...     ["John Doe", "PERSON", "Jane Smith"],
...     ["Los Angeles", "LOCATION", "New York City"],
...   ]

param pairs:: List of static obfuscation pairs. Each pair should contain three elements: [original, entityType, fake]. The pairs must have exactly 3 elements: [original, entityType, fake].
type pairs:: list

setStaticObfuscationPairsResource(path, read_as=ReadAs.TEXT, options=None)#

Sets file with static obfuscation pairs The pairs must have exactly 3 elements: [original, entityType, fake]. The first element is the original value, the second is the entity type, and the third is the fake value. A delimiter is used to separate the elements in each line. You can specify the delimiter in the options parameter.

Parameters:

path (str) – Path to the external resource
read_as (str, optional) – How to read the resource, by default ReadAs.TEXT
options (dict, optional) – Options for reading the resource, by default {“format”: “text”}

setUnnormalizedDateMode(mode: str)#

Sets the mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.

Parameters:: mode (str) – The mode to use if the date is not formatted.

setUseShifDays(s)#

setUseShiftDays(s)#

Sets if you want to use the random shift day when the document has this in its metadata. Default: False

Parameters:: s (bool) – Whether to use the random shift day when the document has this in its metadata. Default: False

setZipCodeTag(tag: str)#

Tag representing zip codes in the obfuscate reference file (default: ZIP)

Parameters:: tag (str) – Tag representing zip codes in the obfuscate reference file (default: ZIP)

write() → JavaMLWriter#: Returns an MLWriter instance for this ML instance.

class DeIdentificationModel(classname='com.johnsnowlabs.legal.chunk_classification.deid.DeIdentificationModel', java_model=None)#

Bases: sparknlp_jsl.annotator.deid.DeIdentificationModel

The DeIdentificationModel model can obfuscate or mask the entities that contains personal information.

These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

Input Annotation types	Output Annotation type
`DOCUMENT, CHUNK, TOKEN`	`DOCUMENT`

Parameters:

regexPatternsDictionary – Dictionary with regular expression patterns that match some protected entity
mode – Mode for Anonimizer [‘mask’|’obfuscate’]
obfuscateDate – When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs (default: false)
dateTag – Tag representing dates in the obfuscate reference file (default: DATE)
days – Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used
dateToYear – True if we want the model to transform dates into years, False otherwise.
minYear – Minimum year to be used when transforming dates into years.
dateFormats – List of date formats to automatically displace if parsed
consistentObfuscation – Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.
sameEntityThreshold – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
obfuscateRefSource – The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The values ar the following: file: Takes the entities from the obfuscatorRefFile faker: Takes the entities from the Faker module both: Takes the entities from the obfuscatorRefFile and the faker module randomly.
regexOverride – If is true prioritize the regex entities, if is false prioritize the ner.
language – The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.
seed – It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut.
ignoreRegex – Select if you want to use regex file loaded in the model.If true the default regex file will be not used.The default value is false.
isRandomDateDisplacement – Use a random displacement days in dates entities,that random number is based on the [[DeIdentificationParams.seed]] If true use random displacement days in dates entities,if false use the [[DeIdentificationParams.days]] The default value is false.
mappingsColumn – This is the mapping column that will return the Annotations chunks with the fake entities.
returnEntityMappings – With this property you select if you want to return mapping column
blackList – List of entities ignored for masking or obfuscation.The default values are: “SSN”,”PASSPORT”,”DLN”,”NPI”,”C_CARD”,”IBAN”,”DEA”
regexEntities – Keep the regex entities used in the regexPatternDictionary
maskingPolicy – Select the masking policy: - same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. - entity_labels: Replace the values with the corresponding entity labels. - fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
fixedMaskLength – This is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking policy is selected.
genderAwareness – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
sameLengthFormattedEntities – List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE
keepYear (bool) – Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
keepMonth (bool) – Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
...
>>>  sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence") \
...     .setUseAbbreviations(True)
...
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
...
>> embeddings = WordEmbeddingsModel \
...     .pretrained("embeddings_clinical", "en", "clinical/models") \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings")
...
 Ner entities
>>> clinical_sensitive_entities = MedicalNerModel \
...     .pretrained("ner_deid_enriched", "en", "clinical/models") \
...     .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
...
>>> nerConverter = NerConverter() \
...     .setInputCols(["sentence", "token", "ner"]) \
...     .setOutputCol("ner_con")
...
 Deidentification
>>> deIdentification = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") \
...     .setInputCols(["ner_chunk", "token", "sentence"]) \
...     .setOutputCol("dei") \
...     .setMode("obfuscate") \
...     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
...     .setObfuscateDate(True) \
...     .setDateTag("DATE") \
...     .setDays(5) \
...     .setObfuscateRefSource("both")
>>> data = spark.createDataFrame([
...     ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
...     ]).toDF("text")
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     sentenceDetector,
...     tokenizer,
...     embeddings,
...     clinical_sensitive_entities,
...     nerConverter,
...     deIdentification
... ])
>>> result = pipeline.fit(data).transform(data)
>>> result.select("dei.result").show(truncate = False)
 +--------------------------------------------------------------------------------------------------+
 |result                                                                                            |
 +--------------------------------------------------------------------------------------------------+
 |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
 +--------------------------------------------------------------------------------------------------+

additionalDateFormats#

ageRanges#

ageRangesByHipaa#

blackList#

blackListEntities#

consistentAcrossNameParts#

consistentObfuscation#

countryObfuscation#

dateEntities#

dateFormats#

dateTag#

dateToYear#

days#

deidMarkers#

enableDefaultObfuscationEquivalents#

fakerLengthOffset#

fixedMaskLength#

genderAwareness#

geoConsistency#

getter_attrs = []#

groupByCol#

ignoreRegex#

inputAnnotatorTypes#

inputCols#

isRandomDateDisplacement#

keepMonth#

keepTextSizeForObfuscation#

keepYear#

language#

lazyAnnotator#

mappingsColumn#

maskingPolicy#

maxRandomDisplacementDays#

metadataMaskingPolicy#

minYear#

mode#

name = 'DeIdentificationModel'#

obfuscateByAgeGroups#

obfuscateDate#

obfuscateRefSource#

obfuscateZipByHipaa#

obfuscateZipKeepDigits#

obfuscationStrategyOnException#

optionalInputAnnotatorTypes = []#

outputAnnotatorType = 'document'#

outputAsDocument#

outputCol#

regexEntities#

regexOverride#

region#

returnEntityMappings#

sameEntityThreshold#

sameLengthFormattedEntities#

seed#

skipLPInputColsValidation = True#

uid = ''#

unnormalizedDateMode#

useShifDays#

useShiftDays#

zipCodeTag#

clear(param: pyspark.ml.param.Param) → None#: Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) → JP#

Parameters:: extra (dict, optional) – Extra parameters to copy to the new instance
Returns:: Copy of this instance
Return type:: JavaParams

explainParam(param: str | Param) → str#: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() → str#: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) → pyspark.ml._typing.ParamMap#

Parameters:: extra (dict, optional) – extra param values
Returns:: merged param map
Return type:: dict

getChunkMatching()#: Gets the value of chunkMatching or its default value.

getDefaultObfuscationEquivalents()#: Returns the default obfuscation equivalents for common entities.

getDeidMarkers()#: Returns the current deid markers as a tuple (prefix, suffix).

getInputCols()#: Gets current column names of input annotations.

getLazyAnnotator()#: Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) → Any#
getOrDefault(param: Param[T]) → T: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#: Gets output column name of annotations.

getParam(paramName: str) → Param#: Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:: paramName (str) – Name of the parameter

getRegexEntities()#: Return the regexEntities value.

getSelectiveObfuscateRefSource()#: Returns the dictionary of entity names to their obfuscate ref sources.

getUseShiftDays()#: Return the useShiftDays value.

hasDefault(param: str | Param[Any]) → bool#: Checks whether a param has a default value.

hasParam(paramName: str) → bool#: Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#

isDefined(param: str | Param[Any]) → bool#: Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) → bool#: Checks whether a param is explicitly set by user.

classmethod load(path: str) → RL#: Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='legner_deid', lang='en', remote_loc='legal/models')#

Download a pre-trained DeIdentificationModel.

Parameters:

name (str) – Name of the pre-trained model, by default “legner_deid”
lang (str) – Language of the pre-trained model, by default “en”
remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.

Returns:

A pre-trained DeIdentificationModel.

Return type:

DeIdentificationModel

classmethod read()#: Returns an MLReader instance for this class.

save(path: str) → None#: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) → None#: Sets a parameter in the embedded param map.

setAdditionalDateFormats(formats: list)#

Sets additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default date formats.

Parameters:: formats (list[str]) – List of additional date formats to be considered during date obfuscation.

setAgeGroups(value: dict)#

Default and example dictionary#

>>> { "baby": [0, 1],
...   "toddler": [1, 4],
...   "child": [4, 13],
...   "teenager": [13, 20],
...   "adult": [20, 65],
...   "senior": [65, 100] }

param value:: A dictionary of age groups to obfuscate ages.
type value:: dict[str, List[int]]

setAgeRanges(value: list)#

Sets list of integer specifying limits of the age groups to preserve during obfuscation.

Parameters:: value (List[int]) – List of integer specifying limits of the age groups to preserve during obfuscation.

setAgeRangesByHipaa(value: bool)#

Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.

Parameters:: value (bool) – If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.

setBlackList(s)#

Parameters:: s (list) – List of entities that will be ignored in the regex file. The rest will be processed.

setBlackListEntities(value)#

Sets a list of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Default: []

Parameters:: value (list) – List of entities coming from NER or regex rules that will be ignored for masking or obfuscation.

setChunkMatching(chunkMatching)#

Sets entity chunk matching configuration for de-identification pipelines.

Parameters:: chunkMatching (dict[str, float]) – A dictionary specifying entity labels and associated confidence thresholds for chunk matching logic.

Notes

When applying the method across multiple rows, the groupByCol parameter is required.

setConsistentAcrossNameParts(value: bool)#

For example, if “John Smith” is obfuscated as “Liam Brown”, then:

When the full name “John Smith” appears, it will be replaced with “Liam Brown”
When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation.

Default: True

Parameters:: value (bool) – Whether to enforce consistent obfuscation across name parts.

setConsistentObfuscation(s)#

Sets whether to replace very similar entities in a document with the same randomized term (default: true). The similarity is based on the Levenshtein Distance between the words.

Parameters:: s (str) – Whether to replace very similar entities in a document with the same randomized term. The similarity is based on the Levenshtein Distance between the words.

setCountryObfuscation(value: bool)#

Sets whether to obfuscate country entities or not. If True, country entities will be obfuscated. If False, country entities will not be obfuscated.

Parameters:: value (bool) – Whether to obfuscate country entities or not. Default: False.

setDateEntities(entities: list)#

Sets list of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’, ‘EFFDATE’, ‘FISCAL_YEAR’]

Parameters:: entities (list[str]) – List of date entities.

setDateFormats(formats: list)#

Sets list of date formats to automatically displace if parsed

Parameters:: formats (list[str]) – List of date formats to automatically displace if parsed

setDateTag(tag: str)#

Set Tag representing what are the NER entity (default: DATE)

Parameters:: tag (str) – Tag representing what are the NER entity (default: DATE)

setDateToYear(s)#

Sets transform dates into years.

Parameters:: s (bool) – True if we want the model to transform dates into years, False otherwise.

setDays(day: int)#

Sets number of days to obfuscate by displacement the dates. If not provided a random integer between 1 and 60 will be used.

Parameters:: day (int) – Number of days to obfuscate by displacement the dates.

setDeidMarkers(value)#

Sets the markers used to wrap deidentified entities in the output text.

Defaults to (“”, “”), meaning no markers are added.

Parameters:: value (list, tuple, or dict) – A pair of markers (prefix, suffix) as a list/tuple, or a dict with ‘start’ (prefix) and/or ‘end’ (suffix) keys. If only one key is provided, the other defaults to “”. Valid dict keys are ‘start’ and ‘end’ only.

setEnableDefaultObfuscationEquivalents(value: bool)#

Parameters:: value (bool) – Whether to enable default obfuscation equivalents for common entities. Default is False.

setFakerLengthOffset(value)#

It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Must be greater than 0. Default: 3

Parameters:: value (int) – Integer value to specify length deviation.

setFixedMaskLength(length)#: The length of the masking sequence in case of fixed_length_chars masking policy. Default: 7 :param length: The length of the masking sequence in case of fixed_length_chars masking policy. :type length: int

setForceInputTypeValidation(etfm)#

setGenderAwareness(value: bool)#

Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Parameters:: value (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

setGeoConsistency(value: bool)#

Sets whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone.

Functionality Overview#

Supported Entity Types#

The following geographical entities are processed with priority order:

state (Priority: 0) - US state names
city (Priority: 1) - City names
zip (Priority: 2) - Zip codes
street (Priority: 3) - Street addresses
phone (Priority: 4) - Phone numbers

Language Requirement#

IMPORTANT: Geographic consistency is only applied when:

geoConsistency parameter is set to True AND
language parameter is set to "en"

For non-English configurations, this feature is automatically disabled regardless of the parameter setting.

Consistency Algorithm#

When geographical entities comes from the chunk columns:

Entity Grouping: All geographic entities are identified and grouped by type
Fake Address Selection: A consistent set of fake US addresses is selected using hash-based deterministic selection to ensure reproducibility
Priority-Based Mapping: Entities are mapped to fake addresses following the priority order (state → city → zip → street → phone)
Consistent Replacement: All entities of the same type within a document use the same fake address pool, maintaining geographical coherence

Parameter Interactions#

IMPORTANT: Enabling this parameter automatically disables:

keepTextSizeForObfuscation - Text size preservation is not maintained
consistentObfuscation - Standard consistency rules are overridden
file-based fakers

This is necessary because geographic consistency requires specific fake address selection that may not preserve original text lengths or follow standard obfuscation patterns.

Examples

Basic usage:

>>> from sparknlp_jsl.annotator import DeIdentification
>>> deid = DeIdentification() \
...     .setInputCols(["sentence", "token", "ner_chunk"]) \
...     .setOutputCol("deidentified") \
...     .setGeoConsistency(True) \
...     .setLanguage("en")

param value:: Whether to enforce consistent obfuscation across geographical entities. Default is False.
type value:: bool

setGroupByCol(value)#

Default: “” (empty string, meaning no grouping)

Notes

The column name must be a valid string in the input dataset.
The column must be of StringType.
This functionality can change order of the dataset, so it is recommended to use it with caution.
This functionality cannot be supported by LightPipeline.

Parameters:: value (str) – The column name used to group the dataset.

setIgnoreRegex(s)#

Sets whether you want to use regex or not. If the value is ‘True’, it can increase performance but might decrease accuracy. Default: False.

Parameters:

s (bool) – Whether to use regex or not. If the value is ‘True’,
Default (it can increase performance but might decrease accuracy.) –

setInputCols(*value)#

Sets column names of input annotations.

Parameters:: *value (List[str]) – Input columns for the annotator

setIsRandomDateDisplacement(s)#

Sets if you want to use random displacement in dates

Parameters:: s (bool) – Boolean value to select if you want to use random displacement in dates

setKeepMonth(value: bool)#

Parameters:: value (bool) – Whether to keep the month intact when obfuscating date entities.

setKeepTextSizeForObfuscation(value: bool)#

Parameters:: value (bool) – Whether to keep the text length same obfuscating entities.

setKeepYear(value: bool)#

Parameters:: value (bool) – Whether to keep the year intact when obfuscating date entities.

setLanguage(lang: str)#

Parameters:: lang (str) – The language used to select the regex file and some faker entities. Default:’en’.

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:: value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMappingsColumn(value: str)#

Sets the name of mapping column that will return the Annotations chunks with the fake entities. You can change the name of the column with this property. Default is ‘aux’.

Parameters:: value (str) – Mapping column that will return the Annotations chunks with the fake entities

setMaskingPolicy(mask: str)#

Sets the masking policy:

same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets,

entity_labels: Replace the values with the corresponding entity labels.
fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
entity_labels_without_brackets: Replace the values with the entity value without brackets.
same_length_chars_without_brackets: Replace the name with the asterisks with same length without brackets.

Parameters:: mask (str) – The masking policy

setMaxRandomDisplacementDays(days: int)#

Sets maximum number of days for random date displacement. Default is 1825.

Parameters:: days (int) – Maximum number of days for random date displacement.

setMetadataMaskingPolicy(value: str)#

Sets metadata masking policy. If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:

‘entity_labels’: Replace the values with the entity value.
‘same_length_chars’: Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
‘fixed_length_chars’: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
‘same_length_chars_without_brackets’: Replace the name with the asterix with same length without brackets.
‘entity_labels_without_brackets’: Replace the values with the entity value without brackets.
Default: “”

Parameters:: value (str) – If specified, the metadata includes the masked form of the document.

setMinYear(s)#

Sets minimum year to be used when transforming dates into years. Default: ‘1929’

Parameters:: s (int) – Minimum year to be used when transforming dates into years. Default: ‘1929’

setMode(mode: str)#

Sets mode for Anonymizer [‘mask’|’obfuscate’]

Parameters:: mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]

setObfuscateByAgeGroups(value: bool)#

Parameters:: value (bool) – Whether to obfuscate ages based on age groups.

setObfuscateDate(value: bool)#

Parameters:: value (bool) – When mode==’obfuscate’ whether to obfuscate dates or not. Default: False.

setObfuscateRefSource(source: str)#

The source of obfuscation to obfuscate the entities. For dates entities, This property is invalid. The values ar the following:

custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly.

Parameters:: source (str) – The source of obfuscation to obfuscate the entities. Default: faker.

setObfuscateZipByHipaa(value: bool)#

Sets whether to apply HIPAA Safe Harbor ZIP code obfuscation rules.

Behavior#

True:
Apply HIPAA Safe Harbor rules for ZIP/ZIP+4 codes:
Extract the first five digits from the input (accepting formats like “12345”, “12345-6789”, “123456789”, and other tolerant forms).

If the first three-digit ZIP prefix is in the HIPAA restricted list (the 17 prefixes derived from 2000 Census data), the ZIP is suppressed to the canonical value “000**”.

Otherwise, the ZIP is generalized to the first three digits followed by “**” (i.e. XXX**). The +4 portion will be masked with asterisks if present.
False:
HIPAA-specific ZIP masking is not applied. Instead, the component’s default or user-defined ZIP obfuscation rules will be used.

param value:: If True, apply HIPAA Safe Harbor ZIP obfuscation rules. If False, skip HIPAA-specific rules and use the default/custom ZIP obfuscation.
type value:: bool

setObfuscateZipKeepDigits(value: int)#

Sets the number of leading ZIP code digits to preserve when applying HIPAA-based ZIP code obfuscation.

This parameter is only effective when obfuscateZipByHipaa is enabled.

Behavior#

Preserves the first value digits of the ZIP code.

Masks all remaining digits— including any ZIP+4 portion—with asterisks (*).

Default: 3.

Allowed range: 0 to 5.

Examples

12345 → 123**
If value = 2 → 12***

param value:: Number of ZIP digits to preserve before masking. Must be between 0 and 5 (inclusive).
type value:: int

setObfuscationEquivalents(equivalents)#

Sets variant-to-canonical entity mappings to ensure consistent obfuscation.

It accepts a list of string triplets, where each triplet defines:

variant: A non-standard, short, or alternative form of a value (e.g., “Alex”)
entityType: The type of the entity (e.g., “NAME”, “STATE”, “COUNTRY”)
canonical: The standardized form all variants map to (e.g., “Alexander”)

Example:#

equivalents = [: [“Alex”, “NAME”, “Alexander”], [“Rob”, “NAME”, “Robert”], [“CA”, “STATE”, “California”], [“Calif.”, “STATE”, “California”]

]

my_deid_transformer.setObfuscationEquivalents(equivalents)

param equivalents:: list of [variant, entityType, canonical] triplets
raises ValueError:: if any entry does not have exactly 3 elements
return:: self

setObfuscationStrategyOnException(value: str)#

Sets the obfuscation strategy to be applied when an exception occurs. Four possible values are supported:

“mask”: The original chunk is replaced with a masking pattern.
“default”: The original chunk is replaced with a default faker.
“skip”: The original chunk is not replaced with any faker.
“exception”: Throws the exception.

The default obfuscation strategy is “default”.

Parameters:: value (str) – The obfuscation strategy to set. Should be one of [“mask”, “skip”, “default”, “exception”].

setOutputAsDocument(l)#

Set whether to return all sentences joined into a single document

Parameters:: l (str) – Whether to return all sentences joined into a single document

setOutputCol(value)#

Sets output column name of annotations.

Parameters:: value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:: paramName (str) – Name of the parameter

setParams()#

setRegexOverride(s)#

Sets whether to prioritize regex over ner entities. If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false.

Parameters:: s (bool) – Whether to prioritize regex over ner entities

setRegion(value: str)#

Parameters:: value (str) – The region to use select date formats. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’

setReturnEntityMappings(value: bool)#

Enable to return mapping column that contains the fake entities.

Parameters:: value (bool) – Whether to return the mapping column.

setSameEntityThreshold(s)#

Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

Parameters:: s (float) – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

setSameLengthFormattedEntities(value: list)#

Sets list of formatted entities to generate the same length outputs as original ones during obfuscation

Parameters:

value (List[str]) –
are (The supported and default formatted entities) –

setSeed(s)#

Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Parameters:: s (int) – The seed to select the entities on obfuscate mode.

setSelectiveObfuscateRefSource(source: dict)#

Example:#

>>> selective_sources = {
... 'PHONE': 'file',
... 'ADDRESS': 'both'
... }
>>> deid.setObfuscateRefSource('faker').setSelectiveObfuscateRefSource(selective_sources)

param source:: A dictionary of entity names to their obfuscation modes. The keys are entity names and the values are the obfuscation sources.
type source:: dict[str, str]

setSelectiveObfuscationModes(value: dict)#

Sets the dictionary of modes to enable multi-mode deIdentification.

‘obfuscate’: Replace the values with random values.
‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end.
‘mask_entity_labels’: Replace the values with the entity value.
‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()”
‘mask_entity_labels_without_brackets’: Replace the values with the entity value without brackets.
‘mask_same_length_chars_without_brackets’: Replace the name with the asterix with same length without brackets.
‘skip’: Skip the values (intact)

The entities which have not been given in dictionary will deidentify according to mode parameter.

Example:#

>>> DeidAnnotator() \
>>>     .setMode('mask') \
>>>     .setSelectiveObfuscationModes({'obfuscate': ['PHONE', 'email'],
>>>                                   'mask_entity_labels': ['NAME', 'CITY'],
>>>                                   'skip': ['id']})

param value:: The dictionary of modes to enable multi-mode deIdentification.
type value:: dict[str, list[str]]

setStaticObfuscationPairs(pairs: list)#

Example:#

>>> pairs = [
...     ["John Doe", "PERSON", "Jane Smith"],
...     ["Los Angeles", "LOCATION", "New York City"],
...   ]

param pairs:: List of static obfuscation pairs. Each pair should contain three elements: [original, entityType, fake]. The pairs must have exactly 3 elements: [original, entityType, fake].
type pairs:: list

setUnnormalizedDateMode(mode: str)#

Sets the mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.

Parameters:: mode (str) – The mode to use if the date is not formatted.

setUseShifDays(s)#

setUseShiftDays(s)#

Sets if you want to use the random shift day when the document has this in its metadata. Default: False

Parameters:: s (bool) – Whether to use the random shift day when the document has this in its metadata. Default: False

setZipCodeTag(tag: str)#

Tag representing zip codes in the obfuscate reference file (default: ZIP)

Parameters:: tag (str) – Tag representing zip codes in the obfuscate reference file (default: ZIP)

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) → pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:

dataset (pyspark.sql.DataFrame) – input dataset
params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() → JavaMLWriter#: Returns an MLWriter instance for this ML instance.

sparknlp_jsl.legal.chunk_classification.deid.deIdentification#

Module Contents#

Classes#

Default and example dictionary#

Functionality Overview#

Supported Entity Types#

Language Requirement#

Consistency Algorithm#

Parameter Interactions#

Behavior#

Behavior#

Example:#

Example:#

Example:#

Default and example dictionary#

Functionality Overview#

Supported Entity Types#

Language Requirement#

Consistency Algorithm#

Parameter Interactions#

Behavior#

Behavior#

Example:#

Example:#

Example:#

Example:#

`sparknlp_jsl.legal.chunk_classification.deid.deIdentification`#