sparknlp_jsl.annotator.deid.name_obfuscator#

Module Contents#

Classes#

NameChunkObfuscator

Pretraind model to obfuscate names in a text.

NameChunkObfuscatorApproach

Contains all the methods for training a NameChunkObfuscator model.

class NameChunkObfuscator(classname='com.johnsnowlabs.nlp.annotators.deid.NameChunkObfuscator', java_model=None)#

Bases: sparknlp_jsl.common.AnnotatorModelInternal

Pretraind model to obfuscate names in a text.

Contains all the parameters to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. This module can replace name, patient,and doctor entities with consistent fakers, remain others same.

Input Annotation types

Output Annotation type

CHUNK

CHUNK

Parameters:
  • obfuscateRefFile – File with the faker names to be used for obfuscation

  • refFileFormat – Format of the reference file

  • refSep – Seperator character in refFile

  • seed – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.

  • obfuscateRefSource – Mode for select obfuscation source [‘both’|’faker’| ‘file’]

  • language – The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’

  • sameLength – The sameLength used to select the same length names as original ones during obfuscation. Default: true

  • nameEntities – The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’

  • genderAwareness – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

genderAwareness#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
language#
lazyAnnotator#
nameEntities#
obfuscateRefSource#
optionalInputAnnotatorTypes = []#
outputAnnotatorType = 'chunk'#
outputCol#
sameLength#
seed#
skipLPInputColsValidation = True#
uid = ''#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setForceInputTypeValidation(etfm)#
setGenderAwareness(l)#

Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Parameters:

l (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLanguage(l)#

The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’

Parameters:

l (str) – The language used to select some faker names.’en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian)

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setNameEntities(n)#

The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’

Parameters:
  • n (List[str]) –

  • NAME (The nameEntities used to select entities during obfuscation. The supported name entities are) –

  • PATIENT ('NAME') –

  • Default (and DOCTOR.) –

setObfuscateRefSource(s)#

Sets mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’

Parameters:

s (str) – Mode for select obfuscate source [‘both’|’faker’| ‘file]

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setSameLength(l)#

The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true

Parameters:

l (bool) – The sameLength used to select the same length names as original ones during obfuscation. Default: true

setSeed(s)#
Sets the seed to select the names on obfuscation. With the seed, you can reply an execution several times

with the same output.

Parameters:

s (int) – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class NameChunkObfuscatorApproach#

Bases: sparknlp_jsl.common.AnnotatorApproachInternal

Contains all the methods for training a NameChunkObfuscator model. This module can replace name, patient,and doctor entities with consistent fakers.

Input Annotation types

Output Annotation type

CHUNK

CHUNK

Parameters:
  • obfuscateRefFile – File with the faker names to be used for obfuscation

  • refFileFormat – Format of the reference file

  • refSep – Seperator character in refFile

  • seed – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.

  • obfuscateRefSource – Mode for select obfuscation source [‘both’|’faker’| ‘file’]

  • language – The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’

  • sameLength – The sameLength used to select the same length names as original ones during obfuscation. Default: true

  • nameEntities – The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’

  • genderAwareness – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
...
>>>  sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence") \
...     .setUseAbbreviations(True)
...
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
...
>>> embeddings = WordEmbeddingsModel \
...     .pretrained("embeddings_clinical", "en", "clinical/models") \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings") \
...
 Ner entities
>>> clinical_sensitive_entities = MedicalNerModel \
...     .pretrained("ner_deid_enriched", "en", "clinical/models") \
...     .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
...
>>> nerConverter = NerConverter() \
...     .setInputCols(["sentence", "token", "ner"]) \
...     .setOutputCol("ner_chunk")
 NameChunkObfuscatorApproach
>>> nameChunkObfuscator = NameChunkObfuscatorApproach() \
...     .setInputCols(["ner_chunk"]) \
...     .setOutputCol("replacement") \
...     .setRefFileFormat("csv") \
...     .setRefSep("#") \
...     .setObfuscateRefFile("obfuscator_names.txt") \
...     .setObfuscateRefSource("both") \
...     .setLanguage("en")
...
>>> replacer_name = Replacer() \
...     .setInputCols("replacement", "sentence") \
...     .setOutputCol("obfuscated_name") \
...     .setUseReplacement(True)
Pipeline
>>> data = spark.createDataFrame([
...     ["John Davies is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lorand and was scheduled for emergency assessment."]
...     ]).toDF("text")
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     sentenceDetector,
...     tokenizer,
...     embeddings,
...     clinical_sensitive_entities,
...     nerConverter,
...     nameChunkObfuscator,
...     replacer_name
... ])
>>> result = pipeline.fit(data).transform(data)
>>> result.select("obfuscated_name.result").show(truncate = False)
  +-------------------------------------------------------------------------------------------------------------------------------------+
  |result                                                                                                                               |
  +-------------------------------------------------------------------------------------------------------------------------------------+
  |[Charlestine is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lowery and was scheduled for emergency assessment.]|
  +-------------------------------------------------------------------------------------------------------------------------------------+
genderAwareness#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
language#
lazyAnnotator#
nameEntities#
obfuscateRefFile#
obfuscateRefSource#
optionalInputAnnotatorTypes = []#
outputAnnotatorType = 'chunk'#
outputCol#
refFileFormat#
refSep#
sameLength#
seed#
skipLPInputColsValidation = True#
uid = ''#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setForceInputTypeValidation(etfm)#
setGenderAwareness(l)#

Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Parameters:

l (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLanguage(l)#

The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’

Parameters:

l (str) – The language used to select some faker names.’en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian)

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setNameEntities(n)#

The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’

Parameters:
  • n (List[str]) –

  • NAME (The nameEntities used to select entities during obfuscation. The supported name entities are) –

  • PATIENT ('NAME') –

  • Default (and DOCTOR.) –

setObfuscateRefFile(f)#

Set file with the terms to be used for Obfuscation

Parameters:

f (str) – File with the terms to be used for Obfuscation

setObfuscateRefSource(s)#

Sets mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’

Parameters:

s (str) – Mode for select obfuscate source [‘both’|’faker’| ‘file]

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setRefFileFormat(f)#

Sets format of the reference file

Parameters:

f (str) – Format of the reference file

setRefSep(c)#

Sets separator character in refFile

Parameters:

f (str) – Separator character in refFile

setSameLength(l)#

The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true

Parameters:

l (bool) – The sameLength used to select the same length names as original ones during obfuscation. Default: true

setSeed(s)#
Sets the seed to select the names on obfuscation. With the seed, you can reply an execution several times

with the same output.

Parameters:

s (int) – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.