sparknlp_jsl.annotator.fhir.fhir_deIdentification
#
Contains classes for the FhirDeIdentification.
Module Contents#
Classes#
A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules. |
- class FhirDeIdentification(classname='com.johnsnowlabs.nlp.annotators.deid.fhir.FhirDeIdentification', java_model=None)#
Bases:
sparknlp.internal.AnnotatorTransformer
,pyspark.ml.param.shared.HasInputCol
,pyspark.ml.param.shared.HasOutputCol
,sparknlp.internal.ParamsGettersSetters
A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.
Performs field-level obfuscation on FHIR JSON documents using FHIR Path expressions. Supports R4, R5, and DSTU3 FHIR versions with type-aware de-identification strategies. Additionally, supports different parser types (JSON, XML) for FHIR resources.
- Parameters:
fhirVersion (str) – The FHIR version to use for de-identification. Options: [‘R4’|’R5’|’DSTU3’]
parserType (str) – The parser type to use for de-identification. Options: [‘JSON’|’XML’]
mode (str) – Mode for Anonimizer [‘mask’|’obfuscate’]
dateEntities (list[str]) – List of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’]
obfuscateDate (bool) – When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to
True
, make sure dateFormats param fits the needs. If the value is True and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting to ‘False’, then the date will be masked to <DATE>. Default: FalseunnormalizedDateMode (str) – The mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.
days (int) – A number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used.
dateFormats (list[str]) – List of date formats to automatically displace if parsed.
obfuscateRefSource (str) – The source of obfuscation of to obfuscate the entities. For dates entities, This property is invalid. The values ar the following: custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly.
language (str) – The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.
seed (int) – It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.
maskingPolicy (str) –
- Select the masking policy:
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
fixedMaskLength (int) – The length of the masking sequence in case of fixed_length_chars masking policy.
sameLengthFormattedEntities (list[str]) – List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE.
genderAwareness (bool) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If the value is true, it might decrease performance. Default: False
ageRanges (list[int]) – list of integer specifying limits of the age groups to preserve during obfuscation.
selectiveObfuscationModes (dict[str, list[str]]) –
- The dictionary of modes to enable multi-mode deIdentification.
’obfuscate’: Replace the values with random values. ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end. ‘mask_entity_labels’: Replace the values with the entity value. ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()” ‘skip’: Skip the values (intact)
The entities which have not been given in dictionary will deidentify according to :param:`mode`
customFakers (dict[str, list[str]]) – The dictionary of custom fakers to specify the obfuscation terms for the entities. You can specify the entity and the terms to be used for obfuscation.
keepYear (bool) – Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
keepMonth (bool) – Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
Example
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> import sparknlp_jsl >>> from sparknlp_jsl.annotator import * >>> rules = { ... "Patient.birthDate" : "DATE", ... "Patient.name.given" : "FIRST_NAME", ... "Patient.name.family" : "LAST_NAME", ... "Patient.telecom.value" : "EMAIL", ... "Patient.gender" : "GENDER" ... }
>>> fhir = ( ... FhirDeIdentification() ... .setInputCol("text") ... .setOutputCol("deid") ... .setMode("obfuscate") ... .setMappingRules(rules) ... .setFhirVersion("R4") ... .setParserType("JSON") ... .setDays(20) ... .setSeed(88) ... .setCustomFakers( ... { ... "GENDER": ["female", "other"] ... } ... ) ... .setObfuscateRefSource("both") ... )
>>> john_doe = "..." >>> df = spark.createDataFrame([[john_doe]]).toDF("text") >>> result_df = fhir.transform(df).cache() >>> result_df.selectExpr("text").show(truncate=False) Original text: +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |text | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{"resourceType": "Patient","id": "example","name": [{"use": "official","family": "Doe","given": ["John","Michael"]}],"telecom": [{"system": "email","value": "john.doe@example.com"},{"system": "url","value": "http://johndoe.com"}],"birthDate": "1970-01-01","gender": "male"}| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>> result_df.selectExpr("deid").show(truncate=False) De-identified text: +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |deid | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Killings","given":["Ellison","Isidor"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"Aurora@google.com"}],"gender":"other","birthDate":"1970-01-21"}| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- ageRanges#
- dateEntities#
- dateFormats#
- days#
- fhirVersion#
- fixedMaskLength#
- genderAwareness#
- getter_attrs = []#
- inputCol :Param[str]#
- keepMonth#
- keepYear#
- language#
- maskingPolicy#
- mode#
- name = 'FhirDeIdentification'#
- obfuscateRefSource#
- outputAnnotatorType = None#
- outputCol :Param[str]#
- parserType#
- sameLengthFormattedEntities#
- seed#
- uid = ''#
- unnormalizedDateMode#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- deidentify(input)#
De-identifies the input FHIR resources.
- deidentify_list(input)#
De-identifies the input FHIR resources.
- Parameters:
input (list[str]) – List of FHIR resources to be de-identified.
- deidentify_str(input)#
De-identifies the input FHIR resources.
- Parameters:
input (str) – FHIR resource to be de-identified.
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCol() str #
Gets the value of inputCol or its default value.
- getMappingRules()#
Gets FHIR field de-identification rules for primitive type obfuscation.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol() str #
Gets the value of outputCol or its default value.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='', lang='en', remote_loc='clinical/models')#
Downloads and loads a pretrained model.
- Parameters:
name (str, optional) – Name of the pretrained model, by default “”
lang (str, optional) – Language of the pretrained model, by default “en”
remote_loc (str, optional) – Optional remote address of the resource, by default clinical/models.
- Returns:
The restored model
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setAgeRanges(value: list)#
Sets list of integer specifying limits of the age groups to preserve during obfuscation.
- Parameters:
value (List[int]) – List of integer specifying limits of the age groups to preserve during obfuscation.
- setCustomFakers(value: dict)#
Sets the dictionary of custom fakers to specify the obfuscation terms for the entities. You can specify the entity and the terms to be used for obfuscation.
Example:#
>>> FhirDeIdentification() \ >>> .setObfuscateRefSource('custom') \ >>> .setCustomFakers({'NAME': ['John', 'Doe', 'Jane'], >>> 'CITY': ['New York', 'Los Angeles'], >>> 'SCHOOL': ['Oxford', 'Harvard']})
- param value:
The dictionary of custom fakers to specify the obfuscation terms for the entities.
- type value:
dict[str, list[str]]
- setDateEntities(entities: list)#
Sets list of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’]
- Parameters:
entities (list[str]) – List of date entities.
- setDateFormats(formats: list)#
Sets list of date formats to automatically displace if parsed
- Parameters:
formats (list[str]) – List of date formats to automatically displace if parsed
- setDays(day: int)#
Sets number of days to obfuscate by displacement the dates. If not provided a random integer between 1 and 60 will be used.
- Parameters:
day (int) – Number of days to obfuscate by displacement the dates.
- setFhirVersion(version: str)#
Sets the FHIR version to use for de-identification. Options: [‘R4’|’R5’|’DSTU3’] Default: ‘R4’
- Parameters:
version (str) – The FHIR version to use for de-identification.
- setFixedMaskLength(length)#
The length of the masking sequence in case of fixed_length_chars masking policy. Default: 7 :param length: The length of the masking sequence in case of fixed_length_chars masking policy. :type length: int
- setGenderAwareness(value: bool)#
Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- Parameters:
value (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False
- setInputCol(value: str)#
Sets input column name.
- Parameters:
value (str) – Name of the input column
- setKeepMonth(value: bool)#
Sets whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
- Parameters:
value (bool) – Whether to keep the month intact when obfuscating date entities.
- setKeepYear(value: bool)#
Sets whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
- Parameters:
value (bool) – Whether to keep the year intact when obfuscating date entities.
- setLanguage(lang: str)#
The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’
- Parameters:
lang (str) – The language used to select the regex file and some faker entities. Default:’en’.
- setMappingRules(rules: dict)#
Sets FHIR field de-identification rules for primitive type obfuscation.
Overview#
Defines how specific FHIR elements should be de-identified using FHIR Path syntax. Supports all FHIR primitive types with built-in obfuscation strategies.
Rule Format#
>>> { ... "ResourceType.field.path": "SupportedEntityClass", ... }
- param value:
A mapping between FHIR paths and target primitive types. Keys must use standard FHIR Path notation (dot-delimited). Values must be one of the supported de-identification entity classes.
- type value:
Dict[str, str]
- raises IllegalArgumentException:
If: - Unsupported primitive type provided - Malformed FHIR path detected - Non-primitive field targeted
Example
>>> FhirDeIdentification() \ ... .setMappingRules({ ... "Patient.birthDate": "Date", ... "Patient.name.given": "Name", ... "Patient.telecom.value": "Email", ... "Patient.address.city": "City", ... })
Notes
Paths are case-sensitive and must match FHIR element names exactly
Array elements should use standard FHIR Path syntax (e.g.,
Patient.name.given
)Only primitive types are supported for de-identification
See also
- setMaskingPolicy(mask: str)#
- Sets the masking policy:
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets,
being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.
entity_labels: Replace the values with the corresponding entity labels.
fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.
- Parameters:
mask (str) – The masking policy
- setMode(mode: str)#
Sets mode for Anonymizer [‘mask’|’obfuscate’]
- Parameters:
mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]
- setObfuscateRefSource(source: str)#
The source of obfuscation to obfuscate the entities. For dates entities, This property is invalid. The values ar the following:
custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly.
- Parameters:
source (str) – The source of obfuscation to obfuscate the entities. Default: faker.
- setOutputCol(value: str)#
Sets output column name.
- Parameters:
value (str) – Name of the Output Column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setParserType(parser: str)#
Sets the parser type to use for de-identification. Options: [‘JSON’|’XML’] Default: ‘JSON’
- Parameters:
parser (str) – The parser type to use for de-identification.
- setSameLengthFormattedEntities(value: list)#
Sets list of formatted entities to generate the same length outputs as original ones during obfuscation
- Parameters:
value (List[str]) –
are (The supported and default formatted entities) –
- setSeed(s)#
Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.
- Parameters:
s (int) – The seed to select the entities on obfuscate mode.
- setSelectiveObfuscationModes(value: dict)#
- Sets the dictionary of modes to enable multi-mode deIdentification.
‘obfuscate’: Replace the values with random values.
‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end.
‘mask_entity_labels’: Replace the values with the entity value.
‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()”
‘skip’: Skip the values (intact)
The entities which have not been given in dictionary will deidentify according to :param:`mode`
Example:#
>>> FhirDeIdentification() \ >>> .setMode('mask') \ >>> .setSelectiveObfuscationModes({'obfuscate': ['PHONE', 'email'], >>> 'mask_entity_labels': ['NAME', 'CITY'], >>> 'skip': ['id']})
- param value:
The dictionary of modes to enable multi-mode deIdentification.
- type value:
dict[str, list[str]]
- setUnnormalizedDateMode(mode: str)#
Sets the mode to use if the date is not formatted. Options: [obfuscate, skip]. Default: obfuscate.
- Parameters:
mode (str) – The mode to use if the date is not formatted.
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.