sparknlp_jsl.annotator.fhir.fhir_deIdentification#

Contains classes for the FhirDeIdentification.

Module Contents#

Classes#

FhirDeIdentification

A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.

class FhirDeIdentification(classname='com.johnsnowlabs.nlp.annotators.deid.fhir.FhirDeIdentification', java_model=None)#

Bases: sparknlp_jsl.annotator.fhir.base_fhir_deidentification.BaseFhirDeIdentification

A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.

Performs field-level obfuscation on FHIR JSON documents using FHIR Path expressions. Supports R4, R5, and DSTU3 FHIR versions with type-aware de-identification strategies. Additionally, supports different parser types (JSON, XML) for FHIR resources.

Parameters:
  • fhirVersion (str) – The FHIR version to use for de-identification. Options: [‘R4’|’R5’|’DSTU3’]

  • parserType (str) – The parser type to use for de-identification. Options: [‘JSON’|’XML’]

Example

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.annotator import *
>>> rules = {
...      "Patient.birthDate" : "DATE",
...      "Patient.name.given" : "FIRST_NAME",
...      "Patient.name.family" : "LAST_NAME",
...      "Patient.telecom.value" : "EMAIL",
...      "Patient.gender" : "GENDER"
...    }
>>> fhir = (
...        FhirDeIdentification()
...          .setInputCol("text")
...          .setOutputCol("deid")
...          .setMode("obfuscate")
...          .setMappingRules(rules)
...          .setFhirVersion("R4")
...          .setParserType("JSON")
...          .setDays(20)
...          .setSeed(88)
...          .setCustomFakers(
...              {
...                  "GENDER": ["female", "other"]
...              }
...          )
...          .setObfuscateRefSource("both")
...    )
>>> john_doe = "..."
>>> df = spark.createDataFrame([[john_doe]]).toDF("text")
>>> result_df = fhir.transform(df).cache()
>>> result_df.selectExpr("text").show(truncate=False)
Original text:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"resourceType": "Patient","id": "example","name": [{"use": "official","family": "Doe","given": ["John","Michael"]}],"telecom": [{"system": "email","value": "john.doe@example.com"},{"system": "url","value": "http://johndoe.com"}],"birthDate": "1970-01-01","gender": "male"}|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> result_df.selectExpr("deid").show(truncate=False)
De-identified text:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|deid                                                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Killings","given":["Ellison","Isidor"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"Aurora@google.com"}],"gender":"other","birthDate":"1970-01-21"}|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
additionalDateFormats#
ageRanges#
ageRangesByHipaa#
blackListEntities#
consistentAcrossNameParts#
countryObfuscation#
dateEntities#
dateFormats#
days#
enableDefaultObfuscationEquivalents#
fakerLengthOffset#
fhirVersion#
fixedMaskLength#
genderAwareness#
geoConsistency#
getter_attrs = []#
inputCol :Param[str]#
isRandomDateDisplacement#
keepMonth#
keepTextSizeForObfuscation#
keepYear#
language#
mappingsColumn#
maskingPolicy#
maxRandomDisplacementDays#
mode#
name = 'FhirDeIdentification'#
obfuscateDate#
obfuscateRefSource#
obfuscateZipByHipaa#
obfuscateZipKeepDigits#
outputAnnotatorType = None#
outputCol :Param[str]#
parserType#
region#
returnEntityMappings#
sameLengthFormattedEntities#
seed#
uid = ''#
unnormalizedDateMode#
useShiftDays#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

deidentify(input)#

De-identifies the input FHIR resources.

deidentify_list(input)#

De-identifies the input FHIR resources.

Parameters:

input (list[str]) – List of FHIR resources to be de-identified.

deidentify_str(input)#

De-identifies the input FHIR resources.

Parameters:

input (str) – FHIR resource to be de-identified.

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getDefaultObfuscationEquivalents()#

Returns the default obfuscation equivalents for common entities.

getInputCol() str#

Gets the value of inputCol or its default value.

getMappingRules()#

Gets FHIR field de-identification rules for primitive type obfuscation.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol() str#

Gets the value of outputCol or its default value.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getSelectiveObfuscateRefSource()#

Returns the dictionary of entity names to their obfuscate ref sources.

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='', lang='en', remote_loc='clinical/models')#

Downloads and loads a pretrained model.

Parameters:
  • name (str, optional) – Name of the pretrained model, by default “”

  • lang (str, optional) – Language of the pretrained model, by default “en”

  • remote_loc (str, optional) – Optional remote address of the resource, by default clinical/models.

Returns:

The restored model

Return type:

FhirDeIdentification

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setAdditionalDateFormats(formats: list)#

Sets additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default date formats.

Parameters:

formats (list[str]) – List of additional date formats to be considered during date obfuscation.

setAgeRanges(value: list)#

Sets list of integer specifying limits of the age groups to preserve during obfuscation.

Parameters:

value (List[int]) – List of integer specifying limits of the age groups to preserve during obfuscation.

setAgeRangesByHipaa(value: bool)#

Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.

Parameters:

value (bool) – If True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If False, ageRanges parameter is valid. Default: False.

abstract setBlackListEntities(value)#

Not supported for FHIR De-Identification.

setConsistentAcrossNameParts(value: bool)#

Sets whether to enforce consistent obfuscation across name parts, even when they appear separately. When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.

For example, if “John Smith” is obfuscated as “Liam Brown”, then:
  • When the full name “John Smith” appears, it will be replaced with “Liam Brown”

  • When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation.

Default: True

Parameters:

value (bool) – Whether to enforce consistent obfuscation across name parts.

abstract setCountryObfuscation(value: bool)#

Not supported for FHIR De-Identification.

setCustomFakers(value: dict)#

Sets the dictionary of custom fakers to specify the obfuscation terms for the entities. You can specify the entity and the terms to be used for obfuscation.

Example:#

>>> FhirDeIdentification() \
>>>     .setObfuscateRefSource('custom') \
>>>     .setCustomFakers({'NAME': ['John', 'Doe', 'Jane'],
>>>                       'CITY': ['New York', 'Los Angeles'],
>>>                       'SCHOOL': ['Oxford', 'Harvard']})
param value:

The dictionary of custom fakers to specify the obfuscation terms for the entities.

type value:

dict[str, list[str]]

setDateEntities(entities: list)#

Sets list of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’, ‘EFFDATE’, ‘FISCAL_YEAR’]

Parameters:

entities (list[str]) – List of date entities.

setDateFormats(formats: list)#

Sets list of date formats to automatically displace if parsed

Parameters:

formats (list[str]) – List of date formats to automatically displace if parsed

setDays(day: int)#

Sets number of days to obfuscate by displacement the dates. If not provided a random integer between 1 and 60 will be used.

Parameters:

day (int) – Number of days to obfuscate by displacement the dates.

setEnableDefaultObfuscationEquivalents(value: bool)#

Sets whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default is False.

Parameters:

value (bool) – Whether to enable default obfuscation equivalents for common entities. Default is False.

setFakerLengthOffset(value)#

It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Must be greater than 0. Default: 3

Parameters:

value (int) – Integer value to specify length deviation.

setFhirVersion(version: str)#

Sets the FHIR version to use for de-identification. Options: [‘R4’|’R5’|’DSTU3’] Default: ‘R4’

Parameters:

version (str) – The FHIR version to use for de-identification.

setFixedMaskLength(length)#

The length of the masking sequence in case of fixed_length_chars masking policy. Default: 7 :param length: The length of the masking sequence in case of fixed_length_chars masking policy. :type length: int

setGenderAwareness(value: bool)#

Set whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Parameters:

value (str) – Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

abstract setGeoConsistency(value: bool)#

Not supported for FHIR De-Identification.

setInputCol(value: str)#

Sets input column name.

Parameters:

value (str) – Name of the input column

setIsRandomDateDisplacement(s)#

Sets if you want to use random displacement in dates

Parameters:

s (bool) – Boolean value to select if you want to use random displacement in dates

setKeepMonth(value: bool)#

Sets whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.

Parameters:

value (bool) – Whether to keep the month intact when obfuscating date entities.

setKeepTextSizeForObfuscation(value: bool)#

It specifies whether the output should maintain the same character length as the input text. If True, the output text will remain the same if same length is available, else length might vary. If False, the output will be completely random. Default: False

Parameters:

value (bool) – Whether to keep the text length same obfuscating entities.

setKeepYear(value: bool)#

Sets whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.

Parameters:

value (bool) – Whether to keep the year intact when obfuscating date entities.

setLanguage(lang: str)#

The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’

Parameters:

lang (str) – The language used to select the regex file and some faker entities. Default:’en’.

setMappingRules(rules: dict)#

Sets FHIR field de-identification rules for primitive type obfuscation.

Overview#

Defines how specific FHIR elements should be de-identified using FHIR Path syntax. Supports all FHIR primitive types with built-in obfuscation strategies.

Rule Format#

>>>    {
...        "ResourceType.field.path": "SupportedEntityClass",
...    }
param value:

A mapping between FHIR paths and target primitive types. Keys must use standard FHIR Path notation (dot-delimited). Values must be one of the supported de-identification entity classes.

type value:

Dict[str, str]

raises IllegalArgumentException:

If: - Unsupported primitive type provided - Malformed FHIR path detected - Non-primitive field targeted

Example

>>> FhirDeIdentification() \
...        .setMappingRules({
...            "Patient.birthDate": "Date",
...            "Patient.name.given": "Name",
...            "Patient.telecom.value": "Email",
...            "Patient.address.city": "City",
...        })

Notes

  • Paths are case-sensitive and must match FHIR element names exactly

  • Array elements should use standard FHIR Path syntax (e.g., Patient.name.given)

  • Only primitive types are supported for de-identification

abstract setMappingsColumn(value: str)#

Not supported for FHIR De-Identification.

setMaskingPolicy(mask: str)#
Sets the masking policy:
  • same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets,

being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.

  • entity_labels: Replace the values with the corresponding entity labels.

  • fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.

  • entity_labels_without_brackets: Replace the values with the entity value without brackets.

  • same_length_chars_without_brackets: Replace the name with the asterisks with same length without brackets.

Parameters:

mask (str) – The masking policy

setMaxRandomDisplacementDays(days: int)#

Sets maximum number of days for random date displacement. Default is 1825.

Parameters:

days (int) – Maximum number of days for random date displacement.

setMode(mode: str)#

Sets mode for Anonymizer [‘mask’|’obfuscate’]

Parameters:

mode (str) – Mode for Anonymizer [‘mask’|’obfuscate’]

abstract setObfuscateDate(value: bool)#

Not supported for FHIR De-Identification.

setObfuscateRefSource(source: str)#

The source of obfuscation to obfuscate the entities. For dates entities, This property is invalid. The values ar the following:

custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly.

Parameters:

source (str) – The source of obfuscation to obfuscate the entities. Default: faker.

setObfuscateZipByHipaa(value: bool)#

Sets whether to apply HIPAA Safe Harbor ZIP code obfuscation rules.

Behavior#

  • True:
    Apply HIPAA Safe Harbor rules for ZIP/ZIP+4 codes:
    1. Extract the first five digits from the input (accepting formats like “12345”, “12345-6789”, “123456789”, and other tolerant forms).

    2. If the first three-digit ZIP prefix is in the HIPAA restricted list (the 17 prefixes derived from 2000 Census data), the ZIP is suppressed to the canonical value “000**”.

    3. Otherwise, the ZIP is generalized to the first three digits followed by “**” (i.e. XXX**). The +4 portion will be masked with asterisks if present.

  • False:

    HIPAA-specific ZIP masking is not applied. Instead, the component’s default or user-defined ZIP obfuscation rules will be used.

param value:

If True, apply HIPAA Safe Harbor ZIP obfuscation rules. If False, skip HIPAA-specific rules and use the default/custom ZIP obfuscation.

type value:

bool

setObfuscateZipKeepDigits(value: int)#

Sets the number of leading ZIP code digits to preserve when applying HIPAA-based ZIP code obfuscation.

This parameter is only effective when obfuscateZipByHipaa is enabled.

Behavior#

  • Preserves the first value digits of the ZIP code.

  • Masks all remaining digits— including any ZIP+4 portion—with asterisks (*).

  • Default: 3.

  • Allowed range: 0 to 5.

Examples
  • 12345123**

  • If value = 212***

This setting overrides the default HIPAA Safe Harbor ZIP generalization pattern (XXX**) and allows clients to customize how many leading digits remain unmasked, enabling expert-determination–based deidentification flows.

param value:

Number of ZIP digits to preserve before masking. Must be between 0 and 5 (inclusive).

type value:

int

setObfuscationEquivalents(equivalents)#

Sets variant-to-canonical entity mappings to ensure consistent obfuscation.

This function allows you to define equivalence rules for entity variants that should be obfuscated the same way. For example, the names “Alex” and “Alexander” will always be mapped to the same obfuscated value if they are linked to the same canonical form.

It accepts a list of string triplets, where each triplet defines:
  • variant: A non-standard, short, or alternative form of a value (e.g., “Alex”)

  • entityType: The type of the entity (e.g., “NAME”, “STATE”, “COUNTRY”)

  • canonical: The standardized form all variants map to (e.g., “Alexander”)

This is especially useful in de-identification tasks to ensure consistent replacement of semantically identical values. It also allows cross-variant normalization across different occurrences of sensitive data. Notes: —— Both variant and entityType comparisons are case-insensitive. For example, “alex”, “Alex”, and “ALEX” are treated as the same variant.

Example:#

equivalents = [

[“Alex”, “NAME”, “Alexander”], [“Rob”, “NAME”, “Robert”], [“CA”, “STATE”, “California”], [“Calif.”, “STATE”, “California”]

]

my_deid_transformer.setObfuscationEquivalents(equivalents)

param equivalents:

list of [variant, entityType, canonical] triplets

raises ValueError:

if any entry does not have exactly 3 elements

return:

self

setOutputCol(value: str)#

Sets output column name.

Parameters:

value (str) – Name of the Output Column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setParserType(parser: str)#

Sets the parser type to use for de-identification. Options: [‘JSON’|’XML’] Default: ‘JSON’

Parameters:

parser (str) – The parser type to use for de-identification.

abstract setRegion(value: str)#

Not supported for FHIR De-Identification.

abstract setReturnEntityMappings(value: bool)#

Not supported for FHIR De-Identification.

setSameLengthFormattedEntities(value: list)#

Sets list of formatted entities to generate the same length outputs as original ones during obfuscation

Parameters:
  • value (List[str]) –

  • are (The supported and default formatted entities) –

setSeed(s)#

Sets the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Parameters:

s (int) – The seed to select the entities on obfuscate mode.

setSelectiveObfuscateRefSource(source: dict)#

A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source.

Example:#

>>> selective_sources = {
... 'PHONE': 'file',
... 'ADDRESS': 'both'
... }
>>> deid.setObfuscateRefSource('faker').setSelectiveObfuscateRefSource(selective_sources)
param source:

A dictionary of entity names to their obfuscation modes. The keys are entity names and the values are the obfuscation sources.

type source:

dict[str, str]

setSelectiveObfuscationModes(value: dict)#
Sets the dictionary of modes to enable multi-mode deIdentification.
  • ‘obfuscate’: Replace the values with random values.

  • ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end.

  • ‘mask_entity_labels’: Replace the values with the entity value.

  • ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()”

  • ‘mask_entity_labels_without_brackets’: Replace the values with the entity value without brackets.

  • ‘mask_same_length_chars_without_brackets’: Replace the name with the asterix with same length without brackets.

  • ‘skip’: Skip the values (intact)

The entities which have not been given in dictionary will deidentify according to mode parameter.

Example:#

>>> DeidAnnotator() \
>>>     .setMode('mask') \
>>>     .setSelectiveObfuscationModes({'obfuscate': ['PHONE', 'email'],
>>>                                   'mask_entity_labels': ['NAME', 'CITY'],
>>>                                   'skip': ['id']})
param value:

The dictionary of modes to enable multi-mode deIdentification.

type value:

dict[str, list[str]]

setStaticObfuscationPairs(pairs: list)#

Sets the static obfuscation pairs This method is used to set static obfuscation pairs that will be used for de-identification. Each pair should contain three elements: original, entity type, and fake. The pairs must have exactly 3 elements: [original, entityType, fake].

Example:#

>>> pairs = [
...     ["John Doe", "PERSON", "Jane Smith"],
...     ["Los Angeles", "LOCATION", "New York City"],
...   ]
param pairs:

List of static obfuscation pairs. Each pair should contain three elements: [original, entityType, fake]. The pairs must have exactly 3 elements: [original, entityType, fake].

type pairs:

list

setUnnormalizedDateMode(mode: str)#

Sets the mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.

Parameters:

mode (str) – The mode to use if the date is not formatted.

abstract setUseShiftDays(value: bool)#

Not supported for FHIR De-Identification.

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.