sparknlp_jsl.annotator.deid.name_obfuscator
#
Module Contents#
Classes#
Pretraind model to obfuscate names in a text. |
|
Contains all the methods for training a NameChunkObfuscator model. |
- class NameChunkObfuscator(classname='com.johnsnowlabs.nlp.annotators.deid.NameChunkObfuscator', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal
Pretraind model to obfuscate names in a text.
Contains all the parameters to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. This module can replace name, patient,and doctor entities with consistent fakers, remain others same.
Input Annotation types
Output Annotation type
CHUNK
CHUNK
- Parameters:
obfuscateRefFile – File with the faker names to be used for obfuscation
refFileFormat – Format of the reference file
refSep – Seperator character in refFile
seed – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.
obfuscateRefSource – Mode for select obfuscation source [‘both’|’faker’| ‘file’]
language – The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
sameLength – The sameLength used to select the same length names as original ones during obfuscation. Default: true
nameEntities – The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’
genderAwareness – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- genderAwareness#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- language#
- lazyAnnotator#
- nameEntities#
- obfuscateRefSource#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'chunk'#
- outputCol#
- sameLength#
- seed#
- skipLPInputColsValidation = True#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setForceInputTypeValidation(etfm)#
- setGenderAwareness(l)#
Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- Parameters:
l (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLanguage(l)#
The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
- Parameters:
l (str) – The language used to select some faker names.’en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian)
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setNameEntities(n)#
The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’
- Parameters:
n (List[str]) –
NAME (The nameEntities used to select entities during obfuscation. The supported name entities are) –
PATIENT ('NAME') –
Default (and DOCTOR.) –
- setObfuscateRefSource(s)#
Sets mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’
- Parameters:
s (str) – Mode for select obfuscate source [‘both’|’faker’| ‘file]
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setSameLength(l)#
The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true
- Parameters:
l (bool) – The sameLength used to select the same length names as original ones during obfuscation. Default: true
- setSeed(s)#
- Sets the seed to select the names on obfuscation. With the seed, you can reply an execution several times
with the same output.
- Parameters:
s (int) – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class NameChunkObfuscatorApproach#
Bases:
sparknlp_jsl.common.AnnotatorApproachInternal
Contains all the methods for training a NameChunkObfuscator model. This module can replace name, patient,and doctor entities with consistent fakers.
Input Annotation types
Output Annotation type
CHUNK
CHUNK
- Parameters:
obfuscateRefFile – File with the faker names to be used for obfuscation
refFileFormat – Format of the reference file
refSep – Seperator character in refFile
seed – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.
obfuscateRefSource – Mode for select obfuscation source [‘both’|’faker’| ‘file’]
language – The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
sameLength – The sameLength used to select the same length names as original ones during obfuscation. Default: true
nameEntities – The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’
genderAwareness – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") ... >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") \ ... .setUseAbbreviations(True) ... >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") ... >>> embeddings = WordEmbeddingsModel \ ... .pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings") \ ... Ner entities >>> clinical_sensitive_entities = MedicalNerModel \ ... .pretrained("ner_deid_enriched", "en", "clinical/models") \ ... .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner") ... >>> nerConverter = NerConverter() \ ... .setInputCols(["sentence", "token", "ner"]) \ ... .setOutputCol("ner_chunk") NameChunkObfuscatorApproach >>> nameChunkObfuscator = NameChunkObfuscatorApproach() \ ... .setInputCols(["ner_chunk"]) \ ... .setOutputCol("replacement") \ ... .setRefFileFormat("csv") \ ... .setRefSep("#") \ ... .setObfuscateRefFile("obfuscator_names.txt") \ ... .setObfuscateRefSource("both") \ ... .setLanguage("en") ... >>> replacer_name = Replacer() \ ... .setInputCols("replacement", "sentence") \ ... .setOutputCol("obfuscated_name") \ ... .setUseReplacement(True) Pipeline >>> data = spark.createDataFrame([ ... ["John Davies is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lorand and was scheduled for emergency assessment."] ... ]).toDF("text") >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... embeddings, ... clinical_sensitive_entities, ... nerConverter, ... nameChunkObfuscator, ... replacer_name ... ]) >>> result = pipeline.fit(data).transform(data) >>> result.select("obfuscated_name.result").show(truncate = False) +-------------------------------------------------------------------------------------------------------------------------------------+ |result | +-------------------------------------------------------------------------------------------------------------------------------------+ |[Charlestine is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lowery and was scheduled for emergency assessment.]| +-------------------------------------------------------------------------------------------------------------------------------------+
- genderAwareness#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- language#
- lazyAnnotator#
- nameEntities#
- obfuscateRefFile#
- obfuscateRefSource#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'chunk'#
- outputCol#
- refFileFormat#
- refSep#
- sameLength#
- seed#
- skipLPInputColsValidation = True#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setForceInputTypeValidation(etfm)#
- setGenderAwareness(l)#
Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- Parameters:
l (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLanguage(l)#
The language used to select some faker names. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
- Parameters:
l (str) – The language used to select some faker names.’en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian)
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setNameEntities(n)#
The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’
- Parameters:
n (List[str]) –
NAME (The nameEntities used to select entities during obfuscation. The supported name entities are) –
PATIENT ('NAME') –
Default (and DOCTOR.) –
- setObfuscateRefFile(f)#
Set file with the terms to be used for Obfuscation
- Parameters:
f (str) – File with the terms to be used for Obfuscation
- setObfuscateRefSource(s)#
Sets mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’
- Parameters:
s (str) – Mode for select obfuscate source [‘both’|’faker’| ‘file]
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setRefFileFormat(f)#
Sets format of the reference file
- Parameters:
f (str) – Format of the reference file
- setRefSep(c)#
Sets separator character in refFile
- Parameters:
f (str) – Separator character in refFile
- setSameLength(l)#
The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true
- Parameters:
l (bool) – The sameLength used to select the same length names as original ones during obfuscation. Default: true
- setSeed(s)#
- Sets the seed to select the names on obfuscation. With the seed, you can reply an execution several times
with the same output.
- Parameters:
s (int) – The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.