sparknlp_jsl.annotator.DeIdentification#

class sparknlp_jsl.annotator.DeIdentification[source]#

Bases: AnnotatorApproach

Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex

Input Annotation types	Output Annotation type
`DOCUMENT, CHUNK, TOKEN`	`DOCUMENT`

Parameters:

regexPatternsDictionary

ictionary with regular expression patterns that match some protected entity

mode

Mode for Anonimizer [‘mask’|’obfuscate’]

obfuscateDate

When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs (default: false)

obfuscateRefFile

File with the terms to be used for Obfuscation

refFileFormat

Format of the reference file

refSep

Sep character in refFile

dateTag

Tag representing dates in the obfuscate reference file (default: DATE)

days

Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used

dateToYear

True if we want the model to transform dates into years, False otherwise.

minYear

Minimum year to be used when transforming dates into years.

dateFormats

List of date formats to automatically displace if parsed

consistentObfuscation

Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

sameEntityThreshold

Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

obfuscateRefSource

The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The values ar the following: file: Takes the entities from the obfuscatorRefFile faker: Takes the entities from the Faker module both : Takes the entities from the obfuscatorRefFile and the faker module randomly.

regexOverride

If is true prioritize the regex entities, if is false prioritize the ner.

seed

It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut.

ignoreRegex

Select if you want to use regex file loaded in the model.If true the default regex file will be not used.The default value is false.

isRandomDateDisplacement

Use a random displacement days in dates entities,that random number is based on the [[DeIdentificationParams.seed]] If true use random displacement days in dates entities,if false use the [[DeIdentificationParams.days]] The default value is false.

mappingsColumn

This is the mapping column that will return the Annotations chunks with the fake entities.

returnEntityMappings

With this property you select if you want to return mapping column

blackList

List of entities ignored for masking or obfuscation.The default values are: “SSN”,”PASSPORT”,”DLN”,”NPI”,”C_CARD”,”IBAN”,”DEA”

maskingPolicy

Select the masking policy:: same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
...
>>>  sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence") \
...     .setUseAbbreviations(True)
...
>>> tokenizer = Tokenizer()     ...     .setInputCols(["sentence"])     ...     .setOutputCol("token")
...
>> embeddings = WordEmbeddingsModel     ...     .pretrained("embeddings_clinical", "en", "clinical/models")     ...     .setInputCols(["sentence", "token"])     ...     .setOutputCol("embeddings")
...
 Ner entities
>>> clinical_sensitive_entities = MedicalNerModel \
...     .pretrained("ner_deid_enriched", "en", "clinical/models") \
...     .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
...
>>> nerConverter = NerConverter() \
...     .setInputCols(["sentence", "token", "ner"]) \
...     .setOutputCol("ner_con")
 Deidentification
>>> deIdentification = DeIdentification() \
...     .setInputCols(["ner_chunk", "token", "sentence"]) \
...     .setOutputCol("dei") \
...     # file with custom regex pattern for custom entities    ...     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
...     # file with custom obfuscator names for the entities    ...     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
...     .setRefFileFormat("csv") \
...     .setRefSep("#") \
...     .setMode("obfuscate") \
...     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
...     .setObfuscateDate(True) \
...     .setDateTag("DATE") \
...     .setDays(5) \
...     .setObfuscateRefSource("file")
Pipeline
>>> data = spark.createDataFrame([
...     ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
...     ]).toDF("text")
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     sentenceDetector,
...     tokenizer,
...     embeddings,
...     clinical_sensitive_entities,
...     nerConverter,
...     deIdentification
... ])
>>> result = pipeline.fit(data).transform(data)
>>> result.select("dei.result").show(truncate = False)
 +--------------------------------------------------------------------------------------------------+
 |result                                                                                            |
 +--------------------------------------------------------------------------------------------------+
 |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
 +--------------------------------------------------------------------------------------------------+

Methods

`__init__`()
`clear`(param)	Clears a param from the param map if it has been explicitly set.
`copy`([extra])	Creates a copy of this instance with the same uid and some extra params.
`explainParam`(param)	Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
`explainParams`()	Returns the documentation of all params with their optionally default values and user-supplied values.
`extractParamMap`([extra])	Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
`fit`(dataset[, params])	Fits a model to the input dataset with optional parameters.
`fitMultiple`(dataset, paramMaps)	Fits a model to the input dataset for each param map in paramMaps.
`getBlackList`()
`getInputCols`()	Gets current column names of input annotations.
`getLazyAnnotator`()	Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
`getOrDefault`(param)	Gets the value of a param in the user-supplied param map or its default value.
`getOutputCol`()	Gets output column name of annotations.
`getParam`(paramName)	Gets a param by its name.
`getParamValue`(paramName)	Gets the value of a parameter.
`hasDefault`(param)	Checks whether a param has a default value.
`hasParam`(paramName)	Tests whether this instance contains a param with a given (string) name.
`isDefined`(param)	Checks whether a param is explicitly set by user or has a default value.
`isSet`(param)	Checks whether a param is explicitly set by user.
`load`(path)	Reads an ML instance from the input path, a shortcut of read().load(path).
`read`()	Returns an MLReader instance for this class.
`save`(path)	Save this ML instance to the given path, a shortcut of 'write().save(path)'.
`set`(param, value)	Sets a parameter in the embedded param map.
`setBlackList`(s)	List of entities ignored for masking or obfuscation. Parameters ---------- s : list List of entities ignored for masking or obfuscation.The default values are: values are "SSN","PASSPORT","DLN","NPI","C_CARD","IBAN","DEA".
`setConsistentObfuscation`(s)	Sets whether to replace very similar entities in a document with the same randomized term (default: true).The similarity is based on the Levenshtein Distance between the words.
`setDateFormats`(s)	Sets list of date formats to automatically displace if parsed
`setDateTag`(t)	Sets tag representing dates in the obfuscate reference file (default: DATE)
`setDateToYear`(s)	Sets transform dates into years.
`setDays`(d)	Sets number of days to obfuscate by displacement the dates.
`setFixedMaskLength`(length)	Fixed mask length: this is the length of the masking sequence that will be used when the 'fixed_length_chars' masking
`setIgnoreRegex`(s)	Sets if you want to use regex.
`setInputCols`(*value)	Sets column names of input annotations.
`setIsRandomDateDisplacement`(s)	Sets if you want to use random displacement in dates
`setLanguage`(l)	The language used to select the regex file and some faker entities.'en'(english),'de'() or 'es'(Spanish)
`setLazyAnnotator`(value)	Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
`setMappingsColumn`(s)	Sets the name of mapping column that will return the Annotations chunks with the fake entities
`setMaskingPolicy`(m)	Sets the masking policy:
`setMinYear`(s)	Sets minimum year to be used when transforming dates into years.
`setMode`(m)	Sets mode for Anonimizer ['mask'\|'obfuscate']
`setObfuscateDate`(value)	Sets auxiliary label which maps resolved entities to additional labels
`setObfuscateRefFile`(f)	Set file with the terms to be used for Obfuscation Parameters ---------- f : str File with the terms to be used for Obfuscation
`setObfuscateRefSource`(s)	Sets mode for select obfuscate source ['both'\|'faker'\| 'file]
`setOutputCol`(value)	Sets output column name of annotations.
`setParamValue`(paramName)	Sets the value of a parameter.
`setRefFileFormat`(f)	Sets format of the reference file
`setRefSep`(c)	Sets separator character in refFile
`setRegexOverride`(s)	Sets whether to prioritize regex over ner entities
`setRegexPatternsDictionary`(path[, read_as, ...])	Sets dictionary with regular expression patterns that match some protected entity
`setReturnEntityMappings`(s)	Sets if you want to return mapping column
`setSameEntityThreshold`(s)	Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
`setSeed`(s)	Sets the seed to select the entities on obfuscate mode
`write`()	Returns an MLWriter instance for this ML instance.

Attributes

`blackList`
`consistentObfuscation`
`dateFormats`
`dateTag`
`dateToYear`
`days`
`fixedMaskLength`
`getter_attrs`
`ignoreRegex`
`inputCols`
`isRandomDateDisplacement`
`language`
`lazyAnnotator`
`mappingsColumn`
`maskingPolicy`
`minYear`
`mode`
`name`
`obfuscateDate`
`obfuscateRefFile`
`obfuscateRefSource`
`outputCol`
`params`	Returns all params ordered by name.
`refFileFormat`
`refSep`
`regexOverride`
`regexPatternsDictionary`
`returnEntityMappings`
`sameEntityThreshold`
`seed`

clear(param)#: Clears a param from the param map if it has been explicitly set.

copy(extra=None)#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:: extra – Extra parameters to copy to the new instance
Returns:: Copy of this instance

explainParam(param)#: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()#: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:: extra – extra param values
Returns:: merged param map

fit(dataset, params=None)#

Fits a model to the input dataset with optional parameters.

Parameters:

dataset – input dataset, which is an instance of pyspark.sql.DataFrame
params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)#

Fits a model to the input dataset for each param map in paramMaps.

Parameters:

dataset – input dataset, which is an instance of pyspark.sql.DataFrame.
paramMaps – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCols()#: Gets current column names of input annotations.

getLazyAnnotator()#: Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)#: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#: Gets output column name of annotations.

getParam(paramName)#: Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramNamestr: Name of the parameter

hasDefault(param)#: Checks whether a param has a default value.

hasParam(paramName)#: Tests whether this instance contains a param with a given (string) name.

isDefined(param)#: Checks whether a param is explicitly set by user or has a default value.

isSet(param)#: Checks whether a param is explicitly set by user.

classmethod load(path)#: Reads an ML instance from the input path, a shortcut of read().load(path).

property params#: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()#: Returns an MLReader instance for this class.

save(path)#: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)#: Sets a parameter in the embedded param map.

setBlackList(s)[source]#: List of entities ignored for masking or obfuscation. Parameters ———- s : list

List of entities ignored for masking or obfuscation.The default values are: values are “SSN”,”PASSPORT”,”DLN”,”NPI”,”C_CARD”,”IBAN”,”DEA”

setConsistentObfuscation(s)[source]#

Sets whether to replace very similar entities in a document with the same randomized term (default: true).The similarity is based on the Levenshtein Distance between the words.

Parameters:

sstr: Whether to replace very similar entities in a document with the same randomized term .The similarity is based on the Levenshtein Distance between the words.

setDateFormats(s)[source]#

Sets list of date formats to automatically displace if parsed

Parameters:

namestr: List of date formats to automatically displace if parsed

setDateTag(t)[source]#

Sets tag representing dates in the obfuscate reference file (default: DATE)

Parameters:

fstr: Tag representing dates in the obfuscate reference file (default: DATE)

setDateToYear(s)[source]#

Sets transform dates into years.

Parameters:

sbool: True if we want the model to transform dates into years, False otherwise.

setDays(d)[source]#

Sets number of days to obfuscate by displacement the dates.

Parameters:

dint: Number of days to obfuscate by displacement the dates.

setFixedMaskLength(length)[source]#

Fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking: policy is selected.

lengthint: The mask length

setIgnoreRegex(s)[source]#

Sets if you want to use regex.

Parameters:

sbool: Whether to use regex.

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*valuestr: Input columns for the annotator

setIsRandomDateDisplacement(s)[source]#

Sets if you want to use random displacement in dates

Parameters:

sbool: Boolean value to select if you want to use random displacement in dates

setLanguage(l)[source]#

The language used to select the regex file and some faker entities.’en’(english),’de’() or ‘es’(Spanish)

Parameters:

lstr: The language used to select the regex file and some faker entities.’en’(english),’de’() or ‘es’(Spanish)

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

valuebool: Whether Annotator should be evaluated lazily in a RecursivePipeline

setMappingsColumn(s)[source]#

Sets the name of mapping column that will return the Annotations chunks with the fake entities

Parameters:

namestr: Mapping column that will return the Annotations chunks with the fake entities

setMaskingPolicy(m)[source]#

Sets the masking policy:: same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.

Parameters:

mstr: The masking policy

setMinYear(s)[source]#

Sets minimum year to be used when transforming dates into years.

Parameters:

sint: Minimum year to be used when transforming dates into years.

setMode(m)[source]#

Sets mode for Anonimizer [‘mask’|’obfuscate’]

Parameters:

mstr: Mode for Anonimizer [‘mask’|’obfuscate’]

setObfuscateDate(value)[source]#

Sets auxiliary label which maps resolved entities to additional labels

Parameters:

valuestr: When mode==”obfuscate” whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs (default: false) WHen setting to ‘false’ then the date will be mask to <DATE>

setObfuscateRefFile(f)[source]#: Set file with the terms to be used for Obfuscation Parameters ———- f : str

File with the terms to be used for Obfuscation

setObfuscateRefSource(s)[source]#

Sets mode for select obfuscate source [‘both’|’faker’| ‘file]

Parameters:

sstr: Mode for select obfuscate source [‘both’|’faker’| ‘file]

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

valuestr: Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramNamestr: Name of the parameter

setRefFileFormat(f)[source]#

Sets format of the reference file

Parameters:

fstr: Format of the reference file

setRefSep(c)[source]#

Sets separator character in refFile

Parameters:

fstr: Separator character in refFile

setRegexOverride(s)[source]#

Sets whether to prioritize regex over ner entities

Parameters:

sbool: Whether to prioritize regex over ner entities

setRegexPatternsDictionary(path, read_as='TEXT', options=None)[source]#

Sets dictionary with regular expression patterns that match some protected entity

Parameters:

pathstr: Path wher is de dictionary
read_as: ReadAs: Format of the file
options: dict: Dictionary with the options to read the file.

setReturnEntityMappings(s)[source]#

Sets if you want to return mapping column

Parameters:

sbool: Whether to return the mappings column.

setSameEntityThreshold(s)[source]#

Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

Parameters:

sfloat: Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).

setSeed(s)[source]#

Sets the seed to select the entities on obfuscate mode

Parameters:

sint: The seed to select the entities on obfuscate mode

uid#: A unique id for the object.

write()#: Returns an MLWriter instance for this ML instance.