sparknlp_jsl.annotator.deid.deIdentification
#
Module Contents#
Classes#
Contains all the methods for training a DeIdentificationModel model. |
|
Removes or obfuscates personal information from the input text. |
- class DeIdentification#
Bases:
sparknlp_jsl.common.AnnotatorApproachInternal
,sparknlp_jsl.annotator.deid.deidentication_params.DeIdentificationParams
,sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams
Contains all the methods for training a DeIdentificationModel model.
This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.
Notes
If the mode is set to ‘obfuscate’, the DeIdentificationModel utilizes java.security.SecureRandom for generating fake data. You can select a generation algorithm by configuring the system environment variable SPARK_NLP_JSL_SEED_ALGORITHM. The chosen algorithm may impact the generation of fake data, performance, and potential blocking issues. For information about standard RNG algorithm names, refer to the SecureRandom section in the Number Generation Algorithms. The default algorithm is ‘SHA1PRNG’.
Input Annotation types
Output Annotation type
DOCUMENT, CHUNK, TOKEN
DOCUMENT
The configuration params for that module are in class DeidentificationParams.
- Parameters:
regexPatternsDictionary – Dictionary with regular expression patterns that match some protected entity
obfuscateRefFile – File with the terms to be used for Obfuscation
refFileFormat – Format of the reference file
refSep – Sep character in refFile
selectiveObfuscationModesPath –
- Dictionary path where is the json that contains the selective obfuscation modes
’obfuscate’: Replace the values with random values. ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end. ‘mask_entity_labels’: Replace the values with the entity value. ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()” ‘skip’: Skip the values (intact)
The entities which have not been given in dictionary will deidentify according to :param:`mode`
entityCasingModesPath –
- Dictionary path where is the json that contains the entity casing modes.
’lowercase’: Converts all characters to lower case using the rules of the default locale. ‘uppercase’: Converts all characters to upper case using the rules of the default locale. ‘capitalize’: Converts the first character to upper case and converts others to lower case. ‘titlecase’: Converts the first character in every token to upper case and converts others to lower case.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") ... >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") \ ... .setUseAbbreviations(True) ... >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") ... >>> embeddings = WordEmbeddingsModel \ ... .pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings") ... Ner entities >>> clinical_sensitive_entities = MedicalNerModel \ ... .pretrained("ner_deid_enriched", "en", "clinical/models") \ ... .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner") ... >>> nerConverter = NerConverter() \ ... .setInputCols(["sentence", "token", "ner"]) \ ... .setOutputCol("ner_con") Deidentification >>> deIdentification = DeIdentification() \ ... .setInputCols(["ner_chunk", "token", "sentence"]) \ ... .setOutputCol("dei") \ ... # file with custom regex pattern for custom entities\ ... .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \ ... # file with custom obfuscator names for the entities\ ... .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \ ... .setRefFileFormat("csv") \ ... .setRefSep("#") \ ... .setMode("obfuscate") \ ... .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \ ... .setObfuscateDate(True) \ ... .setDateTag("DATE") \ ... .setDays(5) \ ... .setObfuscateRefSource("file") Pipeline >>> data = spark.createDataFrame([ ... ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."] ... ]).toDF("text") >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... embeddings, ... clinical_sensitive_entities, ... nerConverter, ... deIdentification ... ]) >>> result = pipeline.fit(data).transform(data) >>> result.select("dei.result").show(truncate = False) +--------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------+ |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]| +--------------------------------------------------------------------------------------------------+
- ageRanges#
- ageRangesByHipaa#
- blackList#
- blackListEntities#
- combineRegexPatterns#
- consistentObfuscation#
- dateFormats#
- dateTag#
- dateToYear#
- days#
- doExceptionHandling#
- entityCasingModesPath#
- fixedMaskLength#
- genderAwareness#
- getter_attrs = []#
- ignoreRegex#
- inputAnnotatorTypes#
- inputCols#
- isRandomDateDisplacement#
- keepMonth#
- keepYear#
- language#
- lazyAnnotator#
- mappingsColumn#
- maskingPolicy#
- metadataMaskingPolicy#
- minYear#
- mode#
- name = 'DeIdentification'#
- obfuscateByAgeGroups#
- obfuscateDate#
- obfuscateRefFile#
- obfuscateRefSource#
- obfuscationStrategyOnException#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'document'#
- outputAsDocument#
- outputCol#
- refFileFormat#
- refSep#
- regexOverride#
- regexPatternsDictionary#
- regexPatternsDictionaryAsJsonString#
- region#
- returnEntityMappings#
- sameEntityThreshold#
- sameLengthFormattedEntities#
- seed#
- selectiveObfuscationModesPath#
- skipLPInputColsValidation = True#
- uid = ''#
- unnormalizedDateMode#
- useShifDays#
- zipCodeTag#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getBlackList()#
Gets the value of blackList or its default value.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getSameLengthFormattedEntities()#
Returns the sameLengthFormattedEntities value.
- getUseShiftDays()#
Return the useShiftDays value.
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setAgeGroups(value: dict)#
Sets a dictionary of age groups to obfuscate ages. For this parameter to be active, the
obfuscateByAgeGroups
parameter must be true. If the givenageGroups
do not fully contain the ages, the ages continue to be obfuscated according to theageRanges
parameter. The dictionary should contain the age group name as the key and an list of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:Default and example dictionary#
>>> { "baby": [0, 1], ... "toddler": [1, 4], ... "child": [4, 13], ... "teenager": [13, 20], ... "adult": [20, 65], ... "senior": [65, 100] }
- param value:
A dictionary of age groups to obfuscate ages.
- type value:
dict[str, List[int]]
- setAgeRanges(s)#
Sets list of integers specifying limits of the age groups to preserve during obfuscation
- Parameters:
s (List[str]) –
- setAgeRangesByHipaa(value: bool)#
Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.
- Parameters:
value (bool) – If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.
- setBlackList(s)#
Sets a list of entities that will be ignored in the regex file. The rest will be processed. The default values are “IBAN”,”ZIP”,”NPI”,”DLN”,”PASSPORT”,”C_CARD”,”DEA”,”SSN”, “IP”, “DEA”. :param s: List of entities that will be ignored in the regex file. The rest will be processed. :type s: list :param The default values are “IBAN”: :param “ZIP”: :param “NPI”: :param “DLN”: :param “PASSPORT”: :param “C_CARD”: :param “DEA”: :param “SSN”: :param “IP”: :param “DEA”.:
- setBlackListEntities(value)#
Sets a list of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Default: []
- Parameters:
value (list) – List of entities coming from NER or regex rules that will be ignored for masking or obfuscation.
- setCombineRegexPatterns(value)#
Sets whether you want to use regex both loaded regex file and default regex file.
If the value is ‘True’, both file will be used. If the value is ‘False’, either loaded file or default file will be used Default: False.
- Parameters:
value (bool) – Whether to combine regex files or not. If the value is ‘True’, both file will be used. Default: False.
- setConsistentObfuscation(s)#
Sets whether to replace very similar entities in a document with the same randomized term (default: true). The similarity is based on the Levenshtein Distance between the words.
- Parameters:
s (str) – Whether to replace very similar entities in a document with the same randomized term. The similarity is based on the Levenshtein Distance between the words.
- setDateFormats(s)#
Sets list of date formats to automatically displace if parsed
- Parameters:
s (str) – List of date formats to automatically displace if parsed
- setDateTag(tag: str)#
Set Tag representing what are the NER entity (default: DATE)
- Parameters:
tag (str) – Tag representing what are the NER entity (default: DATE)
- setDateToYear(s)#
Sets transform dates into years.
- Parameters:
s (bool) – True if we want the model to transform dates into years, False otherwise.
- setDays(d)#
Sets number of days to obfuscate by displacement the dates.
- Parameters:
d (int) – Number of days to obfuscate by displacement the dates.
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setEntityCasingModes(path)#
Sets a Json path which has a dictionary of modes to select casing modes.
‘lowercase’: Converts all characters to lower case using the rules of the default locale. ‘uppercase’: Converts all characters to upper case using the rules of the default locale. ‘capitalize’: Converts the first character to upper case and converts others to lower case. ‘titlecase’: Converts the first character in every token to upper case and converts others to lower case.
- Parameters:
path (str) – Dictionary path where is the json that contains the entity casing modes.
- setFixedMaskLength(length)#
Fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking policy is selected.
- Parameters:
length (int) – The mask length
- setForceInputTypeValidation(etfm)#
- setGenderAwareness(l)#
Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- Parameters:
l (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- setIgnoreRegex(s)#
Sets whether you want to use regex or not. If the value is ‘True’, it can increase performance but might decrease accuracy. Default: False.
- Parameters:
s (bool) – Whether to use regex or not. If the value is ‘True’,
Default (it can increase performance but might decrease accuracy.) –
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setIsRandomDateDisplacement(s)#
Sets if you want to use random displacement in dates
- Parameters:
s (bool) – Boolean value to select if you want to use random displacement in dates
- setKeepMonth(value: bool)#
Sets whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
- Parameters:
value (bool) – Whether to keep the month intact when obfuscating date entities.
- setKeepYear(value: bool)#
Sets whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
- Parameters:
value (bool) – Whether to keep the year intact when obfuscating date entities.
- setLanguage(lang: str)#
The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
- Parameters:
lang (str) – The language used to select the regex file and some faker entities.’en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMappingsColumn(s)#
Sets the name of mapping column that will return the Annotations chunks with the fake entities
- Parameters:
s (str) – Mapping column that will return the Annotations chunks with the fake entities
- setMaskingPolicy(mask: str)#
Sets the masking policy
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.
entity_labels: Replace the values with the corresponding entity labels.
fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
- Parameters:
mask (str) – The masking policy
- setMetadataMaskingPolicy(value: str)#
Sets metadata masking policy. If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:
‘entity_labels’: Replace the values with the entity value.
‘same_length_chars’: Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
‘fixed_length_chars’: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
Default: “”
- Parameters:
value (str) – If specified, the metadata includes the masked form of the document.
- setMinYear(s)#
Sets minimum year to be used when transforming dates into years. Default: ‘1929’
- Parameters:
s (int) – Minimum year to be used when transforming dates into years. Default: ‘1929’
- setMode(mode: str)#
Sets mode for Anonymizer [‘mask’|’obfuscate’]
- Parameters:
mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]
- setObfuscateByAgeGroups(value: bool)#
Sets whether to obfuscate ages based on age groups. When True, the age groups specified in the
ageGroups
parameter will be used to obfuscate ages. When False, the age ranges specified in theageRanges
parameter will be used to obfuscate ages.- Parameters:
value (bool) – Whether to obfuscate ages based on age groups.
- setObfuscateDate(value)#
When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to
true
, make sure dateFormats param fits the needs. If the value istrue
and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting tofalse
, then the date will be masked to <DATE> Default: false- Parameters:
value (str) – When mode==’obfuscate’ whether to obfuscate dates or not. Default: false
- setObfuscateRefFile(f)#
Set file with the terms to be used for Obfuscation
- Parameters:
f (str) – File with the terms to be used for Obfuscation
- setObfuscateRefSource(s)#
Sets mode for select obfuscate source [‘both’|’faker’| ‘file]
- Parameters:
s (str) – Mode for select obfuscate source [‘both’|’faker’| ‘file]
- setObfuscationStrategyOnException(value: str)#
Sets the obfuscation strategy to be applied when an exception occurs. Four possible values are supported:
“mask”: The original chunk is replaced with a masking pattern.
“default”: The original chunk is replaced with a default faker.
“skip”: The original chunk is not replaced with any faker.
“exception”: Throws the exception.
The default obfuscation strategy is “default”.
- Parameters:
value (str) – The obfuscation strategy to set. Should be one of [“mask”, “skip”, “default”, “exception”].
- setOutputAsDocument(l)#
Set whether to return all sentences joined into a single document
- Parameters:
l (str) – Whether to return all sentences joined into a single document
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setRefFileFormat(f)#
Sets format of the reference file
- Parameters:
f (str) – Format of the reference file
- setRefSep(c)#
Sets separator character in refFile
- Parameters:
c (str) – Separator character in refFile
- setRegexOverride(s)#
Sets whether to prioritize regex over ner entities. If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false.
- Parameters:
s (bool) – Whether to prioritize regex over ner entities
- setRegexPatternsDictionary(path, read_as=ReadAs.TEXT, options=None)#
Sets dictionary with regular expression patterns that match some protected entity.
- Parameters:
path (str) – Path where the dictionary is
read_as (ReadAs) – Format of the file
options (dict) – Dictionary with the options to read the file.
- setRegexPatternsDictionaryAsJsonString(json)#
Sets dictionary with regular expression patterns as JSON that match some protected entity.
- Parameters:
json (str) – regex(s) as JSON format.
- setRegion(s)#
With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. For example: Decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. The values are following: ‘eu’ for European Union ‘us’ for the USA Default: ‘eu’
- Parameters:
s (str) – The region to use select date formats. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’
- setReturnEntityMappings(s)#
Sets if you want to return mapping column
- Parameters:
s (bool) – Whether to return the mapping column.
- setSameEntityThreshold(s)#
Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
- Parameters:
s (float) – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
- setSameLengthFormattedEntities(s)#
Sets list of formatted entities to generate the same length outputs as original ones during obfuscation
- Parameters:
s (List[str]) –
are (The supported and default formatted entities) –
- setSeed(s)#
Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.
- Parameters:
s (int) – The seed to select the entities on obfuscate mode.
- setSelectiveObfuscationModes(path)#
Sets a Json path which has a dictionary of modes to enable multi-mode deIdentification.
‘obfuscate’: Replace the values with random values. ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end. ‘mask_entity_labels’: Replace the values with the entity value. ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()” ‘skip’: Skip the values (intact)
The entities which have not been given in dictionary will deidentify according to setMode().
- Parameters:
path (str) – Dictionary path where is the json that contains the selective obfuscation modes
- setUnnormalizedDateMode(s)#
Sets the mode to use if the date is not formatted.
- Parameters:
s (str) – The mode to use if the date is not formatted. [mask, obfuscate, skip] Default: obfuscate
- setUseShifDays(s)#
- setUseShiftDays(s)#
Sets if you want to use the random shift day when the document has this in its metadata. Default: False
- Parameters:
s (bool) – Whether to use the random shift day when the document has this in its metadata. Default: False
- setZipCodeTag(tag: str)#
Tag representing zip codes in the obfuscate reference file (default: ZIP)
- Parameters:
tag (str) – Tag representing zip codes in the obfuscate reference file (default: ZIP)
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class DeIdentificationModel(classname='com.johnsnowlabs.nlp.annotators.deid.DeIdentificationModel', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal
,sparknlp_jsl.annotator.deid.deidentication_params.DeIdentificationParams
,sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams
Removes or obfuscates personal information from the input text.
The DeIdentificationModel model can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.
Notes
If the mode is set to ‘obfuscate’, the DeIdentificationModel utilizes java.security.SecureRandom for generating fake data. You can select a generation algorithm by configuring the system environment variable SPARK_NLP_JSL_SEED_ALGORITHM. The chosen algorithm may impact the generation of fake data, performance, and potential blocking issues. For information about standard RNG algorithm names, refer to the SecureRandom section in the Number Generation Algorithms. The default algorithm is ‘SHA1PRNG’.
Input Annotation types
Output Annotation type
DOCUMENT, CHUNK, TOKEN
DOCUMENT
The configuration params for that module are in class DeidentificationParams.
- Parameters:
regexEntities – Keep the regex entities used in the regexPatternDictionary
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") ... >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") \ ... .setUseAbbreviations(True) ... >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") ... >> embeddings = WordEmbeddingsModel \ ... .pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings") ... Ner entities >>> clinical_sensitive_entities = MedicalNerModel \ ... .pretrained("ner_deid_enriched", "en", "clinical/models") \ ... .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner") ... >>> nerConverter = NerConverter() \ ... .setInputCols(["sentence", "token", "ner"]) \ ... .setOutputCol("ner_con") ... Deidentification >>> deIdentification = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") \ ... .setInputCols(["ner_chunk", "token", "sentence"]) \ ... .setOutputCol("dei") \ ... .setMode("obfuscate") \ ... .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \ ... .setObfuscateDate(True) \ ... .setDateTag("DATE") \ ... .setDays(5) \ ... .setObfuscateRefSource("both") >>> data = spark.createDataFrame([ ... ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."] ... ]).toDF("text") >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... embeddings, ... clinical_sensitive_entities, ... nerConverter, ... deIdentification ... ]) >>> result = pipeline.fit(data).transform(data) >>> result.select("dei.result").show(truncate = False) +--------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------+ |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]| +--------------------------------------------------------------------------------------------------+
- ageRanges#
- ageRangesByHipaa#
- blackList#
- blackListEntities#
- consistentObfuscation#
- dateFormats#
- dateTag#
- dateToYear#
- days#
- doExceptionHandling#
- fixedMaskLength#
- genderAwareness#
- getter_attrs = []#
- ignoreRegex#
- inputAnnotatorTypes#
- inputCols#
- isRandomDateDisplacement#
- keepMonth#
- keepYear#
- language#
- lazyAnnotator#
- mappingsColumn#
- maskingPolicy#
- metadataMaskingPolicy#
- minYear#
- mode#
- name = 'DeIdentificationModel'#
- obfuscateByAgeGroups#
- obfuscateDate#
- obfuscateRefSource#
- obfuscationStrategyOnException#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'document'#
- outputAsDocument#
- outputCol#
- regexEntities#
- regexOverride#
- region#
- returnEntityMappings#
- sameEntityThreshold#
- sameLengthFormattedEntities#
- seed#
- skipLPInputColsValidation = True#
- uid = ''#
- unnormalizedDateMode#
- useShifDays#
- zipCodeTag#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getBlackList()#
Gets the value of blackList or its default value.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getRegexEntities()#
Return the regexEntities value.
- getSameLengthFormattedEntities()#
Returns the sameLengthFormattedEntities value.
- getUseShiftDays()#
Return the useShiftDays value.
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='deidentify_enriched_clinical', lang='en', remote_loc='clinical/models')#
Downloads and loads a pretrained model.
- Parameters:
name (str, optional) – Name of the pretrained model, by default “deidentify_enriched_clinical”
lang (str, optional) – Language of the pretrained model, by default “en”
remote_loc (str, optional) – Optional remote address of the resource, by default clinical/models. Will use Spark NLPs repositories otherwise.
- Returns:
The restored model
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setAgeGroups(value: dict)#
Sets a dictionary of age groups to obfuscate ages. For this parameter to be active, the
obfuscateByAgeGroups
parameter must be true. If the givenageGroups
do not fully contain the ages, the ages continue to be obfuscated according to theageRanges
parameter. The dictionary should contain the age group name as the key and an list of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:Default and example dictionary#
>>> { "baby": [0, 1], ... "toddler": [1, 4], ... "child": [4, 13], ... "teenager": [13, 20], ... "adult": [20, 65], ... "senior": [65, 100] }
- param value:
A dictionary of age groups to obfuscate ages.
- type value:
dict[str, List[int]]
- setAgeRanges(s)#
Sets list of integers specifying limits of the age groups to preserve during obfuscation
- Parameters:
s (List[str]) –
- setAgeRangesByHipaa(value: bool)#
Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.
- Parameters:
value (bool) – If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.
- setBlackList(s)#
Sets a list of entities that will be ignored in the regex file. The rest will be processed. The default values are “IBAN”,”ZIP”,”NPI”,”DLN”,”PASSPORT”,”C_CARD”,”DEA”,”SSN”, “IP”, “DEA”. :param s: List of entities that will be ignored in the regex file. The rest will be processed. :type s: list :param The default values are “IBAN”: :param “ZIP”: :param “NPI”: :param “DLN”: :param “PASSPORT”: :param “C_CARD”: :param “DEA”: :param “SSN”: :param “IP”: :param “DEA”.:
- setBlackListEntities(value)#
Sets a list of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Default: []
- Parameters:
value (list) – List of entities coming from NER or regex rules that will be ignored for masking or obfuscation.
- setConsistentObfuscation(s)#
Sets whether to replace very similar entities in a document with the same randomized term (default: true). The similarity is based on the Levenshtein Distance between the words.
- Parameters:
s (str) – Whether to replace very similar entities in a document with the same randomized term. The similarity is based on the Levenshtein Distance between the words.
- setDateFormats(s)#
Sets list of date formats to automatically displace if parsed
- Parameters:
s (str) – List of date formats to automatically displace if parsed
- setDateTag(tag: str)#
Set Tag representing what are the NER entity (default: DATE)
- Parameters:
tag (str) – Tag representing what are the NER entity (default: DATE)
- setDateToYear(s)#
Sets transform dates into years.
- Parameters:
s (bool) – True if we want the model to transform dates into years, False otherwise.
- setDays(d)#
Sets number of days to obfuscate by displacement the dates.
- Parameters:
d (int) – Number of days to obfuscate by displacement the dates.
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setFixedMaskLength(length)#
Fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking policy is selected.
- Parameters:
length (int) – The mask length
- setForceInputTypeValidation(etfm)#
- setGenderAwareness(l)#
Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- Parameters:
l (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- setIgnoreRegex(s)#
Sets whether you want to use regex or not. If the value is ‘True’, it can increase performance but might decrease accuracy. Default: False.
- Parameters:
s (bool) – Whether to use regex or not. If the value is ‘True’,
Default (it can increase performance but might decrease accuracy.) –
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setIsRandomDateDisplacement(s)#
Sets if you want to use random displacement in dates
- Parameters:
s (bool) – Boolean value to select if you want to use random displacement in dates
- setKeepMonth(value: bool)#
Sets whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
- Parameters:
value (bool) – Whether to keep the month intact when obfuscating date entities.
- setKeepYear(value: bool)#
Sets whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
- Parameters:
value (bool) – Whether to keep the year intact when obfuscating date entities.
- setLanguage(lang: str)#
The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
- Parameters:
lang (str) – The language used to select the regex file and some faker entities.’en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMappingsColumn(s)#
Sets the name of mapping column that will return the Annotations chunks with the fake entities
- Parameters:
s (str) – Mapping column that will return the Annotations chunks with the fake entities
- setMaskingPolicy(mask: str)#
Sets the masking policy
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.
entity_labels: Replace the values with the corresponding entity labels.
fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
- Parameters:
mask (str) – The masking policy
- setMetadataMaskingPolicy(value: str)#
Sets metadata masking policy. If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:
‘entity_labels’: Replace the values with the entity value.
‘same_length_chars’: Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
‘fixed_length_chars’: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
Default: “”
- Parameters:
value (str) – If specified, the metadata includes the masked form of the document.
- setMinYear(s)#
Sets minimum year to be used when transforming dates into years. Default: ‘1929’
- Parameters:
s (int) – Minimum year to be used when transforming dates into years. Default: ‘1929’
- setMode(mode: str)#
Sets mode for Anonymizer [‘mask’|’obfuscate’]
- Parameters:
mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]
- setObfuscateByAgeGroups(value: bool)#
Sets whether to obfuscate ages based on age groups. When True, the age groups specified in the
ageGroups
parameter will be used to obfuscate ages. When False, the age ranges specified in theageRanges
parameter will be used to obfuscate ages.- Parameters:
value (bool) – Whether to obfuscate ages based on age groups.
- setObfuscateDate(value)#
When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to
true
, make sure dateFormats param fits the needs. If the value istrue
and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting tofalse
, then the date will be masked to <DATE> Default: false- Parameters:
value (str) – When mode==’obfuscate’ whether to obfuscate dates or not. Default: false
- setObfuscateRefSource(s)#
Sets mode for select obfuscate source [‘both’|’faker’| ‘file]
- Parameters:
s (str) – Mode for select obfuscate source [‘both’|’faker’| ‘file]
- setObfuscationStrategyOnException(value: str)#
Sets the obfuscation strategy to be applied when an exception occurs. Four possible values are supported:
“mask”: The original chunk is replaced with a masking pattern.
“default”: The original chunk is replaced with a default faker.
“skip”: The original chunk is not replaced with any faker.
“exception”: Throws the exception.
The default obfuscation strategy is “default”.
- Parameters:
value (str) – The obfuscation strategy to set. Should be one of [“mask”, “skip”, “default”, “exception”].
- setOutputAsDocument(l)#
Set whether to return all sentences joined into a single document
- Parameters:
l (str) – Whether to return all sentences joined into a single document
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setRegexOverride(s)#
Sets whether to prioritize regex over ner entities. If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false.
- Parameters:
s (bool) – Whether to prioritize regex over ner entities
- setRegion(s)#
With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. For example: Decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. The values are following: ‘eu’ for European Union ‘us’ for the USA Default: ‘eu’
- Parameters:
s (str) – The region to use select date formats. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’
- setReturnEntityMappings(s)#
Sets if you want to return mapping column
- Parameters:
s (bool) – Whether to return the mapping column.
- setSameEntityThreshold(s)#
Sets similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
- Parameters:
s (float) – Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9).
- setSameLengthFormattedEntities(s)#
Sets list of formatted entities to generate the same length outputs as original ones during obfuscation
- Parameters:
s (List[str]) –
are (The supported and default formatted entities) –
- setSeed(s)#
Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.
- Parameters:
s (int) – The seed to select the entities on obfuscate mode.
- setUnnormalizedDateMode(s)#
Sets the mode to use if the date is not formatted.
- Parameters:
s (str) – The mode to use if the date is not formatted. [mask, obfuscate, skip] Default: obfuscate
- setUseShifDays(s)#
- setUseShiftDays(s)#
Sets if you want to use the random shift day when the document has this in its metadata. Default: False
- Parameters:
s (bool) – Whether to use the random shift day when the document has this in its metadata. Default: False
- setZipCodeTag(tag: str)#
Tag representing zip codes in the obfuscate reference file (default: ZIP)
- Parameters:
tag (str) – Tag representing zip codes in the obfuscate reference file (default: ZIP)
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.