sparknlp_jsl.annotator.DeIdentification#
- class sparknlp_jsl.annotator.DeIdentification[source]#
Bases:
AnnotatorApproach
Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex
Input Annotation types
Output Annotation type
DOCUMENT, CHUNK, TOKEN
DOCUMENT
- Parameters:
- regexPatternsDictionary
ictionary with regular expression patterns that match some protected entity
- mode
Mode for Anonimizer [‘mask’|’obfuscate’]
- obfuscateDate
When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to
true
, make sure dateFormats param fits the needs (default: false)- obfuscateRefFile
File with the terms to be used for Obfuscation
- refFileFormat
Format of the reference file
- refSep
Sep character in refFile
- dateTag
Tag representing dates in the obfuscate reference file (default: DATE)
- days
Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used
- dateToYear
True if we want the model to transform dates into years, False otherwise.
- minYear
Minimum year to be used when transforming dates into years.
- dateFormats
List of date formats to automatically displace if parsed
- consistentObfuscation
Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.
- sameEntityThreshold
Similarity threshold [0.0-1.0] to consider two appearances of an entity as
the same
(default: 0.9).- obfuscateRefSource
The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The values ar the following: file: Takes the entities from the obfuscatorRefFile faker: Takes the entities from the Faker module both : Takes the entities from the obfuscatorRefFile and the faker module randomly.
- regexOverride
If is true prioritize the regex entities, if is false prioritize the ner.
- seed
It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut.
- ignoreRegex
Select if you want to use regex file loaded in the model.If true the default regex file will be not used.The default value is false.
- isRandomDateDisplacement
Use a random displacement days in dates entities,that random number is based on the [[DeIdentificationParams.seed]] If true use random displacement days in dates entities,if false use the [[DeIdentificationParams.days]] The default value is false.
- mappingsColumn
This is the mapping column that will return the Annotations chunks with the fake entities.
- returnEntityMappings
With this property you select if you want to return mapping column
- blackList
List of entities ignored for masking or obfuscation.The default values are: “SSN”,”PASSPORT”,”DLN”,”NPI”,”C_CARD”,”IBAN”,”DEA”
- maskingPolicy
- Select the masking policy:
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") ... >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") \ ... .setUseAbbreviations(True) ... >>> tokenizer = Tokenizer() ... .setInputCols(["sentence"]) ... .setOutputCol("token") ... >> embeddings = WordEmbeddingsModel ... .pretrained("embeddings_clinical", "en", "clinical/models") ... .setInputCols(["sentence", "token"]) ... .setOutputCol("embeddings") ... Ner entities >>> clinical_sensitive_entities = MedicalNerModel \ ... .pretrained("ner_deid_enriched", "en", "clinical/models") \ ... .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner") ... >>> nerConverter = NerConverter() \ ... .setInputCols(["sentence", "token", "ner"]) \ ... .setOutputCol("ner_con") Deidentification >>> deIdentification = DeIdentification() \ ... .setInputCols(["ner_chunk", "token", "sentence"]) \ ... .setOutputCol("dei") \ ... # file with custom regex pattern for custom entities ... .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \ ... # file with custom obfuscator names for the entities ... .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \ ... .setRefFileFormat("csv") \ ... .setRefSep("#") \ ... .setMode("obfuscate") \ ... .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \ ... .setObfuscateDate(True) \ ... .setDateTag("DATE") \ ... .setDays(5) \ ... .setObfuscateRefSource("file") Pipeline >>> data = spark.createDataFrame([ ... ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."] ... ]).toDF("text") >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... embeddings, ... clinical_sensitive_entities, ... nerConverter, ... deIdentification ... ]) >>> result = pipeline.fit(data).transform(data) >>> result.select("dei.result").show(truncate = False) +--------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------+ |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]| +--------------------------------------------------------------------------------------------------+
Methods
__init__
()clear
(param)Clears a param from the param map if it has been explicitly set.
copy
([extra])Creates a copy of this instance with the same uid and some extra params.
explainParam
(param)Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
extractParamMap
([extra])Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
fit
(dataset[, params])Fits a model to the input dataset with optional parameters.
fitMultiple
(dataset, paramMaps)Fits a model to the input dataset for each param map in paramMaps.
getBlackList
()Gets current column names of input annotations.
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
getOrDefault
(param)Gets the value of a param in the user-supplied param map or its default value.
Gets output column name of annotations.
getParam
(paramName)Gets a param by its name.
getParamValue
(paramName)Gets the value of a parameter.
hasDefault
(param)Checks whether a param has a default value.
hasParam
(paramName)Tests whether this instance contains a param with a given (string) name.
isDefined
(param)Checks whether a param is explicitly set by user or has a default value.
isSet
(param)Checks whether a param is explicitly set by user.
load
(path)Reads an ML instance from the input path, a shortcut of read().load(path).
read
()Returns an MLReader instance for this class.
save
(path)Save this ML instance to the given path, a shortcut of 'write().save(path)'.
set
(param, value)Sets a parameter in the embedded param map.
setBlackList
(s)List of entities ignored for masking or obfuscation. Parameters ---------- s : list List of entities ignored for masking or obfuscation.The default values are: values are "SSN","PASSPORT","DLN","NPI","C_CARD","IBAN","DEA".
Sets whether to replace very similar entities in a document with the same randomized term (default: true).The similarity is based on the Levenshtein Distance between the words.
Sets list of date formats to automatically displace if parsed
setDateTag
(t)Sets tag representing dates in the obfuscate reference file (default: DATE)
Sets transform dates into years.
setDays
(d)Sets number of days to obfuscate by displacement the dates.
setFixedMaskLength
(length)Fixed mask length: this is the length of the masking sequence that will be used when the 'fixed_length_chars' masking
Sets if you want to use regex.
setInputCols
(*value)Sets column names of input annotations.
Sets if you want to use random displacement in dates
setLanguage
(l)The language used to select the regex file and some faker entities.'en'(english),'de'() or 'es'(Spanish)
setLazyAnnotator
(value)Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
Sets the name of mapping column that will return the Annotations chunks with the fake entities
Sets the masking policy:
setMinYear
(s)Sets minimum year to be used when transforming dates into years.
setMode
(m)Sets mode for Anonimizer ['mask'|'obfuscate']
setObfuscateDate
(value)Sets auxiliary label which maps resolved entities to additional labels
Set file with the terms to be used for Obfuscation Parameters ---------- f : str File with the terms to be used for Obfuscation
Sets mode for select obfuscate source ['both'|'faker'| 'file]
setOutputCol
(value)Sets output column name of annotations.
setParamValue
(paramName)Sets the value of a parameter.
Sets format of the reference file
setRefSep
(c)Sets separator character in refFile
Sets whether to prioritize regex over ner entities
setRegexPatternsDictionary
(path[, read_as, ...])Sets dictionary with regular expression patterns that match some protected entity
Sets if you want to return mapping column
Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
setSeed
(s)Sets the seed to select the entities on obfuscate mode
write
()Returns an MLWriter instance for this ML instance.
Attributes
blackList
consistentObfuscation
dateFormats
dateTag
dateToYear
days
fixedMaskLength
getter_attrs
ignoreRegex
inputCols
isRandomDateDisplacement
language
lazyAnnotator
mappingsColumn
maskingPolicy
minYear
mode
name
obfuscateDate
obfuscateRefFile
obfuscateRefSource
outputCol
Returns all params ordered by name.
refFileFormat
refSep
regexOverride
regexPatternsDictionary
returnEntityMappings
sameEntityThreshold
seed
- clear(param)#
Clears a param from the param map if it has been explicitly set.
- copy(extra=None)#
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- explainParam(param)#
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams()#
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra=None)#
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra – extra param values
- Returns:
merged param map
- fit(dataset, params=None)#
Fits a model to the input dataset with optional parameters.
- Parameters:
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
New in version 1.3.0.
- fitMultiple(dataset, paramMaps)#
Fits a model to the input dataset for each param map in paramMaps.
- Parameters:
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
.paramMaps – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
New in version 2.3.0.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param)#
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName)#
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
- paramNamestr
Name of the parameter
- hasDefault(param)#
Checks whether a param has a default value.
- hasParam(paramName)#
Tests whether this instance contains a param with a given (string) name.
- isDefined(param)#
Checks whether a param is explicitly set by user or has a default value.
- isSet(param)#
Checks whether a param is explicitly set by user.
- classmethod load(path)#
Reads an ML instance from the input path, a shortcut of read().load(path).
- property params#
Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
- classmethod read()#
Returns an MLReader instance for this class.
- save(path)#
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param, value)#
Sets a parameter in the embedded param map.
- setBlackList(s)[source]#
List of entities ignored for masking or obfuscation. Parameters ———- s : list
List of entities ignored for masking or obfuscation.The default values are: values are “SSN”,”PASSPORT”,”DLN”,”NPI”,”C_CARD”,”IBAN”,”DEA”
- setConsistentObfuscation(s)[source]#
Sets whether to replace very similar entities in a document with the same randomized term (default: true).The similarity is based on the Levenshtein Distance between the words.
- Parameters:
- sstr
Whether to replace very similar entities in a document with the same randomized term .The similarity is based on the Levenshtein Distance between the words.
- setDateFormats(s)[source]#
Sets list of date formats to automatically displace if parsed
- Parameters:
- namestr
List of date formats to automatically displace if parsed
- setDateTag(t)[source]#
Sets tag representing dates in the obfuscate reference file (default: DATE)
- Parameters:
- fstr
Tag representing dates in the obfuscate reference file (default: DATE)
- setDateToYear(s)[source]#
Sets transform dates into years.
- Parameters:
- sbool
True if we want the model to transform dates into years, False otherwise.
- setDays(d)[source]#
Sets number of days to obfuscate by displacement the dates.
- Parameters:
- dint
Number of days to obfuscate by displacement the dates.
- setFixedMaskLength(length)[source]#
- Fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking
policy is selected.
- lengthint
The mask length
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
- *valuestr
Input columns for the annotator
- setIsRandomDateDisplacement(s)[source]#
Sets if you want to use random displacement in dates
- Parameters:
- sbool
Boolean value to select if you want to use random displacement in dates
- setLanguage(l)[source]#
The language used to select the regex file and some faker entities.’en’(english),’de’() or ‘es’(Spanish)
- Parameters:
- lstr
The language used to select the regex file and some faker entities.’en’(english),’de’() or ‘es’(Spanish)
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
- valuebool
Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMappingsColumn(s)[source]#
Sets the name of mapping column that will return the Annotations chunks with the fake entities
- Parameters:
- namestr
Mapping column that will return the Annotations chunks with the fake entities
- setMaskingPolicy(m)[source]#
- Sets the masking policy:
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
- Parameters:
- mstr
The masking policy
- setMinYear(s)[source]#
Sets minimum year to be used when transforming dates into years.
- Parameters:
- sint
Minimum year to be used when transforming dates into years.
- setMode(m)[source]#
Sets mode for Anonimizer [‘mask’|’obfuscate’]
- Parameters:
- mstr
Mode for Anonimizer [‘mask’|’obfuscate’]
- setObfuscateDate(value)[source]#
Sets auxiliary label which maps resolved entities to additional labels
- Parameters:
- valuestr
When mode==”obfuscate” whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs (default: false) WHen setting to ‘false’ then the date will be mask to <DATE>
- setObfuscateRefFile(f)[source]#
Set file with the terms to be used for Obfuscation Parameters ———- f : str
File with the terms to be used for Obfuscation
- setObfuscateRefSource(s)[source]#
Sets mode for select obfuscate source [‘both’|’faker’| ‘file]
- Parameters:
- sstr
Mode for select obfuscate source [‘both’|’faker’| ‘file]
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
- valuestr
Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
- paramNamestr
Name of the parameter
- setRefFileFormat(f)[source]#
Sets format of the reference file
- Parameters:
- fstr
Format of the reference file
- setRefSep(c)[source]#
Sets separator character in refFile
- Parameters:
- fstr
Separator character in refFile
- setRegexOverride(s)[source]#
Sets whether to prioritize regex over ner entities
- Parameters:
- sbool
Whether to prioritize regex over ner entities
- setRegexPatternsDictionary(path, read_as='TEXT', options=None)[source]#
Sets dictionary with regular expression patterns that match some protected entity
- Parameters:
- pathstr
Path wher is de dictionary
- read_as: ReadAs
Format of the file
- options: dict
Dictionary with the options to read the file.
- setReturnEntityMappings(s)[source]#
Sets if you want to return mapping column
- Parameters:
- sbool
Whether to return the mappings column.
- setSameEntityThreshold(s)[source]#
Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
- Parameters:
- sfloat
Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
- setSeed(s)[source]#
Sets the seed to select the entities on obfuscate mode
- Parameters:
- sint
The seed to select the entities on obfuscate mode
- uid#
A unique id for the object.
- write()#
Returns an MLWriter instance for this ML instance.