DeIdentification

Companion object DeIdentification

class DeIdentification extends AnnotatorApproach[DeIdentificationModel] with DeIdentificationParams with DeidApproachParams with HandleExceptionParams with CheckLicense

Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

DATE \d{4}
AID \d{6,7}

Additionally, obfuscation strings can be defined with DeidApproachParams.setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with DeidApproachParams.setRefFileFormat and DeidApproachParams.setRefSep.

Dr. Gregory House#DOCTOR
01010101#MEDICALRECORD

The configuration params for that module are in trait DeIdentificationParams.

Exceptions thrown

java.security.NoSuchAlgorithmException If no Provider supports a SecureRandom implementation for specified algorithm name.

Note

If the mode is set to obfuscate, the DeIdentification uses java.security.SecureRandom for generating fake data. You can select a generation algorithm by configuring the system environment variable SPARK_NLP_JSL_SEED_ALGORITHM. The chosen algorithm may impact the generation of fake data, performance, and potential blocking issues. For information about standard RNG algorithm names, refer to the SecureRandom section in the Number Generation Algorithm. The default algorithm is 'SHA1PRNG'.

See also

DeIdentificationModel

DeIdentificationParams

DeidApproachParams

train Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs Example of pipeline for deidentification.

Example

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")
    .setUseAbbreviations(true)

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel
    .pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

Ner entities

 val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")

 val nerConverter = new NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_chunk")

Deidentification

val deIdentification = new DeIdentification()
    .setInputCols(Array("ner_chunk", "token", "sentence"))
    .setOutputCol("dei")
    // file with custom regex patterns for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
    // file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
    .setRefFileFormat("csv")
    .setRefSep("#")
    .setMode("obfuscate")
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
    .setObfuscateDate(true)
    .setDateTag("DATE")
    .setDays(5)
    .setObfuscateRefSource("file")

Pipeline

val data = Seq(
  "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
).toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  clinical_sensitive_entities,
  nerConverter,
  deIdentification
))
val result = pipeline.fit(data).transform(data)


result.select("dei.result").show(truncate = false)

Show Results

result.select("dei.result").show(truncate = false)
+--------------------------------------------------------------------------------------------------+
|result                                                                                            |
+--------------------------------------------------------------------------------------------------+
|[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
+--------------------------------------------------------------------------------------------------+

Linear Supertypes

CheckLicense, HandleExceptionParams, DeidApproachParams, DeIdentificationParams, MaskingParams, BaseDeidParams, HasFeatures, AnnotatorApproach[DeIdentificationModel], CanBeLazy, DefaultParamsWritable, MLWritable, HasOutputAnnotatorType, HasOutputAnnotationCol, HasInputAnnotationCols, Estimator[DeIdentificationModel], PipelineStage, Logging, Params, Serializable, Serializable, Identifiable, AnyRef, Any

Ordering

Grouped
Alphabetic
By Inheritance

Inherited

DeIdentification
CheckLicense
HandleExceptionParams
DeidApproachParams
DeIdentificationParams
MaskingParams
BaseDeidParams
HasFeatures
AnnotatorApproach
CanBeLazy
DefaultParamsWritable
MLWritable
HasOutputAnnotatorType
HasOutputAnnotationCol
HasInputAnnotationCols
Estimator
PipelineStage
Logging
Params
Serializable
Serializable
Identifiable
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Instance Constructors

new DeIdentification()
new DeIdentification(uid: String)
uid
a unique identifier for the instanced Annotator

Exceptions thrown
java.security.NoSuchAlgorithmException If no Provider supports a SecureRandom implementation for specified algorithm name.

Type Members

type AnnotatorType = String

Definition Classes
HasOutputAnnotatorType

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def $[T](param: Param[T]): T

Attributes
protected
Definition Classes
Params
def $$[T](feature: StructFeature[T]): T

Attributes
protected
Definition Classes
HasFeatures
def $$[K, V](feature: MapFeature[K, V]): Map[K, V]

Attributes
protected
Definition Classes
HasFeatures
def $$[T](feature: SetFeature[T]): Set[T]

Attributes
protected
Definition Classes
HasFeatures
def $$[T](feature: ArrayFeature[T]): Array[T]

Attributes
protected
Definition Classes
HasFeatures
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def _fit(dataset: Dataset[_], recursiveStages: Option[PipelineModel]): DeIdentificationModel

Attributes
protected
Definition Classes
AnnotatorApproach
val additionalDateFormats: StringArrayParam
Additional date formats to be considered during date obfuscation.
Additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default dateFormats.

Definition Classes
BaseDeidParams
val ageGroups: StructFeature[Map[String, Array[Int]]]
A map of age groups to obfuscate ages.
A map of age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges. The map should contain the age group name as the key and an array of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:
```
Map(
"baby" -> Array(0, 1),
"toddler" -> Array(1, 4),
"child" -> Array(4, 13),
"teenager" -> Array(13, 20),
"adult" -> Array(20, 65),
"senior" -> Array(65, 200)
)
```
Definition Classes
DeIdentificationParams
val ageRanges: IntArrayParam
List of integers specifying limits of the age groups to preserve during obfuscation
List of integers specifying limits of the age groups to preserve during obfuscation

Definition Classes
BaseDeidParams
val ageRangesByHipaa: BooleanParam
A Boolean variable indicating whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
A Boolean variable indicating whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.
When true, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. When false, ageRanges parameter is valid.

Definition Classes
BaseDeidParams
final def asInstanceOf[T0]: T0

Definition Classes
Any
def beforeTraining(spark: SparkSession): Unit

Definition Classes
AnnotatorApproach
val blackList: StringArrayParam
List of entities that will be ignored in the regex file.
List of entities that will be ignored in the regex file. The rest will be processed. The default values are "IBAN","ZIP","NPI","DLN","PASSPORT","C_CARD","DEA","SSN", "IP", "DEA".

Definition Classes
DeIdentificationParams
val blackListEntities: StringArrayParam
List of entities coming from NER or regex rules that will be ignored for masking or obfuscation.
List of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Defaults to an empty array.

Definition Classes
DeIdentificationParams
final def checkSchema(schema: StructType, inputAnnotatorType: String): Boolean

Attributes
protected
Definition Classes
HasInputAnnotationCols
def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit

Definition Classes
CheckLicense
def checkValidScope(scope: String): Unit

Definition Classes
CheckLicense
def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes
CheckLicense
def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes
CheckLicense
val chunkMatching: MapFeature[String, Double]
Performs entity chunk matching across rows or within groups in a DataFrame.
Performs entity chunk matching across rows or within groups in a DataFrame. Useful in de-identification pipelines where certain entity labels like "NAME" or "DATE" may be missing in some rows and need to be filled from other rows in the same group.

Definition Classes
DeIdentificationParams
Note
When applying the method across multiple rows, the usage of groupByCol parameter is required.
final def clear(param: Param[_]): DeIdentification.this.type

Definition Classes
Params
def clone(): AnyRef

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( ... ) @native()
val combineRegexPatterns: BooleanParam
If the value is true both loaded regex file and default regex file are used together; if the value is false, either loaded regex file or default regex file is used.
If the value is true both loaded regex file and default regex file are used together; if the value is false, either loaded regex file or default regex file is used. The default value is false If the value is true, it uses the default regex file regardless of the value of the regexOverride.
lazy val combinedDateFormats: Array[String]

Attributes
protected
Definition Classes
BaseDeidParams
val consistentAcrossNameParts: BooleanParam
Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name).
Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name). When set to true, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.
For example, if "John Smith" is obfuscated as "Liam Brown", then:
- When the full name "John Smith" appears, it will be replaced with "Liam Brown"
- When "John" or "Smith" appear individually, they will still be obfuscated as "Liam" and "Brown" respectively, ensuring consistency in name transformation.
Default: true
Definition Classes
BaseDeidParams
val consistentObfuscation: BooleanParam
Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.
Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

Definition Classes
DeIdentificationParams
final def copy(extra: ParamMap): Estimator[DeIdentificationModel]

Definition Classes
AnnotatorApproach → Estimator → PipelineStage → Params
def copyValues[T <: Params](to: T, extra: ParamMap): T

Attributes
protected
Definition Classes
Params
val countryObfuscation: BooleanParam
Whether to obfuscate country entities or not.
Whether to obfuscate country entities or not. If true, country entities will be obfuscated using the Faker module. If false, country entities will be skipped during obfuscation. Default: false

Definition Classes
BaseDeidParams
val dateEntities: StringArrayParam
List of date entities.
List of date entities. Default: Array("DATE", "DOB", "DOD", "EFFDATE", "FISCAL_YEAR")

Definition Classes
BaseDeidParams
val dateFormats: StringArrayParam
Format of dates to displace
Format of dates to displace

Definition Classes
BaseDeidParams
val dateTag: Param[String]
Tag representing what are the NER entity (default: DATE)
Tag representing what are the NER entity (default: DATE)

Definition Classes
DeIdentificationParams
val dateToYear: BooleanParam
true if dates must be converted to years, false otherwise
true if dates must be converted to years, false otherwise

Definition Classes
DeIdentificationParams
val days: IntParam
Number of days to obfuscate the dates by displacement.
Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used

Definition Classes
BaseDeidParams
final def defaultCopy[T <: Params](extra: ParamMap): T

Attributes
protected
Definition Classes
Params
val description: String

Definition Classes
DeIdentification → AnnotatorApproach
val doExceptionHandling: BooleanParam
If true, exceptions are handled.
If true, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Definition Classes
HandleExceptionParams
val enableDefaultObfuscationEquivalents: BooleanParam
Whether to enable default obfuscation equivalents for common entities.
Whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default: false

Definition Classes
BaseDeidParams
val entityCasingModesPath: Param[String]
Dictionary path where is the json that contains the entity casing modes.
Dictionary path where is the json that contains the entity casing modes. 'lowercase': Converts all characters to lower case using the rules of the default locale. 'uppercase': Converts all characters to upper case using the rules of the default locale. 'capitalize': Converts the first character to upper case and converts others to lower case. 'titlecase': Converts the first character in every token to upper case and converts others to lower case.
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def explainParam(param: Param[_]): String

Definition Classes
Params
def explainParams(): String

Definition Classes
Params
final def extractParamMap(): ParamMap

Definition Classes
Params
final def extractParamMap(extra: ParamMap): ParamMap

Definition Classes
Params
val fakerLengthOffset: IntParam
It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled.
It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Value must be greater than 0. Default is 3.

Definition Classes
BaseDeidParams
val features: ArrayBuffer[Feature[_, _, _]]

Definition Classes
HasFeatures
def finalize(): Unit

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def fit(dataset: Dataset[_]): DeIdentificationModel

Definition Classes
AnnotatorApproach → Estimator
def fit(dataset: Dataset[_], paramMaps: Seq[ParamMap]): Seq[DeIdentificationModel]

Definition Classes
Estimator
Annotations
@Since( "2.0.0" )
def fit(dataset: Dataset[_], paramMap: ParamMap): DeIdentificationModel

Definition Classes
Estimator
Annotations
@Since( "2.0.0" )
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DeIdentificationModel

Definition Classes
Estimator
Annotations
@Since( "2.0.0" ) @varargs()
val fixedMaskLength: IntParam
Select the fixed mask length: this is the length of the masking sequence that will be used when the 'fixed_length_chars' masking policy is selected.
Select the fixed mask length: this is the length of the masking sequence that will be used when the 'fixed_length_chars' masking policy is selected.

Definition Classes
MaskingParams
val genderAwareness: BooleanParam
Whether to use gender-aware names or not during obfuscation.
Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Definition Classes
BaseDeidParams
val geoConsistency: BooleanParam
Whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone.
Whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone.
## Functionality Overview This parameter enables intelligent geographical entity obfuscation that maintains realistic relationships between different geographic components. When enabled, the system ensures that obfuscated addresses form coherent, valid combinations rather than random replacements.
## Supported Entity Types The following geographical entities are processed with priority order: - **state** (Priority: 0) - US state names - **city** (Priority: 1) - City names - **zip** (Priority: 2) - Zip codes - **street** (Priority: 3) - Street addresses - **phone** (Priority: 4) - Phone numbers
## Language Requirement **IMPORTANT**: Geographic consistency is only applied when: - geoConsistency parameter is set to true AND - language parameter is set to en
For non-English configurations, this feature is automatically disabled regardless of the parameter setting.
## Consistency Algorithm When geographical entities comes from the chunk columns:
1. **Entity Grouping**: All geographic entities are identified and grouped by type 2. **Fake Address Selection**: A consistent set of fake US addresses is selected using hash-based deterministic selection to ensure reproducibility 3. **Priority-Based Mapping**: Entities are mapped to fake addresses following the priority order (state → city → zip → street → phone) 4. **Consistent Replacement**: All entities of the same type within a document use the same fake address pool, maintaining geographical coherence
## Parameter Interactions **IMPORTANT**: Enabling this parameter automatically disables: - keepTextSizeForObfuscation - Text size preservation is not maintained - consistentObfuscation - Standard consistency rules are overridden - file-based fakers
This is necessary because geographic consistency requires specific fake address selection that may not preserve original text lengths or follow standard obfuscation patterns.
default: false

Definition Classes
BaseDeidParams
def get[T](feature: StructFeature[T]): Option[T]

Attributes
protected
Definition Classes
HasFeatures
def get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]

Attributes
protected
Definition Classes
HasFeatures
def get[T](feature: SetFeature[T]): Option[Set[T]]

Attributes
protected
Definition Classes
HasFeatures
def get[T](feature: ArrayFeature[T]): Option[Array[T]]

Attributes
protected
Definition Classes
HasFeatures
final def get[T](param: Param[T]): Option[T]

Definition Classes
Params
def getAdditionalDateFormats: Array[String]
Gets the value of additionalDateFormats
Gets the value of additionalDateFormats

Definition Classes
BaseDeidParams
def getAgeRanges: Array[Int]
Gets ageRanges param.
Gets ageRanges param.

Definition Classes
BaseDeidParams
def getAgeRangesByHipaa: Boolean
Gets the value of ageRangesByHipaa.
Gets the value of ageRangesByHipaa.

Definition Classes
BaseDeidParams
def getBlackListEntities: Array[String]
Gets blackListEntities param
Gets blackListEntities param

Definition Classes
DeIdentificationParams
def getChunkMatching: Map[String, Double]

Definition Classes
DeIdentificationParams
def getChunkMatchingAsStr: String

Definition Classes
DeIdentificationParams
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
Annotations
@native()
def getCombineRegexPatterns: Boolean
def getConsistentAcrossNameParts: Boolean
Gets the value of consistentAcrossNameParts.
Gets the value of consistentAcrossNameParts.

Definition Classes
BaseDeidParams
def getConsistentObfuscation: Boolean

Definition Classes
DeIdentificationParams
def getCountryObfuscation: Boolean
Gets the value of countryObfuscation.
Gets the value of countryObfuscation.

Definition Classes
BaseDeidParams
def getDateEntities: Array[String]
Gets dateEntities param.
Gets dateEntities param.

Definition Classes
BaseDeidParams
def getDateFormats: Array[String]
Gets the value of dateFormats
Gets the value of dateFormats

Definition Classes
BaseDeidParams
def getDateTag: String

Definition Classes
DeIdentificationParams
def getDateToYear: Boolean

Definition Classes
DeIdentificationParams
def getDays: Int
Gets days param
Gets days param

Definition Classes
BaseDeidParams
final def getDefault[T](param: Param[T]): Option[T]

Definition Classes
Params
def getDefaultObfuscationEquivalents: Array[StaticObfuscationEntity]

Definition Classes
BaseDeidParams
def getDefaultObfuscationEquivalentsAsJava: Array[ArrayList[String]]

Definition Classes
BaseDeidParams
def getEnableDefaultObfuscationEquivalents: Boolean
Gets the value of enableDefaultObfuscationEquivalents.
Gets the value of enableDefaultObfuscationEquivalents.

Definition Classes
BaseDeidParams
def getEntityBasedObfuscationRefSource(entityClass: String): String

Attributes
protected
Definition Classes
BaseDeidParams
def getFakerLengthOffset: Int
Gets fakerLengthOffset param
Gets fakerLengthOffset param

Definition Classes
BaseDeidParams
def getFixedMaskLength: Int
Gets fixedMaskLength param.
Gets fixedMaskLength param.

Definition Classes
MaskingParams
def getGenderAwareness: Boolean
Gets genderAwareness param.
Gets genderAwareness param.

Definition Classes
BaseDeidParams
def getGeoConsistency: Boolean
Gets the value of geoConsistency.
Gets the value of geoConsistency.

Definition Classes
BaseDeidParams
def getGroupByCol: String
Gets groupByCol param
Gets groupByCol param

Definition Classes
DeIdentificationParams
def getIgnoreRegex: Boolean

Definition Classes
DeIdentificationParams
def getInputCols: Array[String]

Definition Classes
HasInputAnnotationCols
def getKeepMonth: Boolean
Gets keepMonth param
Gets keepMonth param

Definition Classes
BaseDeidParams
def getKeepTextSizeForObfuscation: Boolean
Gets keepTextSizeForObfuscation param
Gets keepTextSizeForObfuscation param

Definition Classes
BaseDeidParams
def getKeepYear: Boolean
Gets keepYear param
Gets keepYear param

Definition Classes
BaseDeidParams
def getLanguage: String
Gets language param.
Gets language param.

Definition Classes
BaseDeidParams
def getLazyAnnotator: Boolean

Definition Classes
CanBeLazy
def getMappingsColumn: String

Definition Classes
DeIdentificationParams
def getMaskStatus(entityClass: String): String

Attributes
protected
Definition Classes
MaskingParams
def getMaskingPolicy: String
Gets maskingPolicy param.
Gets maskingPolicy param.

Definition Classes
MaskingParams
def getMetadataMaskingPolicy: String
Gets metadataMaskingPolicy param
Gets metadataMaskingPolicy param

Definition Classes
DeIdentificationParams
def getMinYear: Int

Definition Classes
DeIdentificationParams
def getMode: String
Gets mode param.
Gets mode param.

Definition Classes
BaseDeidParams
def getObfuscateByAgeGroups: Boolean
Gets obfuscateByAgeGroups param
Gets obfuscateByAgeGroups param

Definition Classes
DeIdentificationParams
def getObfuscateDate: Boolean
Gets obfuscateDate param
Gets obfuscateDate param

Definition Classes
BaseDeidParams
def getObfuscateRefSource: String
Gets obfuscateRefSource param.
Gets obfuscateRefSource param.

Definition Classes
BaseDeidParams
def getObfuscationEquivalents: Option[Array[StaticObfuscationEntity]]
Gets the value of obfuscationEquivalents.
Gets the value of obfuscationEquivalents.

Definition Classes
BaseDeidParams
def getObfuscationEquivalentsResource: ExternalResource
def getObfuscationStrategyOnException: String

Definition Classes
DeIdentificationParams
final def getOrDefault[T](param: Param[T]): T

Definition Classes
Params
final def getOutputCol: String

Definition Classes
HasOutputAnnotationCol
def getParam(paramName: String): Param[Any]

Definition Classes
Params
def getRegexOverride: Boolean

Definition Classes
DeIdentificationParams
def getRegexPatternsDictionaryAsJsonString: String
def getRegion: String
Gets region param.
Gets region param.

Definition Classes
BaseDeidParams
def getReturnEntityMappings: Boolean

Definition Classes
DeIdentificationParams
def getSameEntityThreshold: Double

Definition Classes
DeIdentificationParams
def getSameLengthFormattedEntities(): Array[String]

Definition Classes
BaseDeidParams
def getSeed(): Int

Definition Classes
BaseDeidParams
def getSelectiveObfuscateRefSource: Map[String, String]
Gets selectiveObfuscateRefSource param.
Gets selectiveObfuscateRefSource param.

Definition Classes
BaseDeidParams
def getSelectiveObfuscateRefSourceAsStr: String

Definition Classes
BaseDeidParams
def getSelectiveObfuscationModes: Option[Map[String, Array[String]]]
Gets selectiveObfuscationModes param.
Gets selectiveObfuscationModes param.

Definition Classes
BaseDeidParams
def getStaticObfuscationPairs: Option[Array[StaticObfuscationEntity]]

Definition Classes
BaseDeidParams
def getStaticObfuscationPairsResource: ExternalResource

Definition Classes
DeidApproachParams
def getUnnormalizedDateMode: String
Gets unnormalizedDateMode param.
Gets unnormalizedDateMode param.

Definition Classes
BaseDeidParams
def getUseShiftDays: Boolean
Getter method of useShiftDays
Getter method of useShiftDays

Definition Classes
DeIdentificationParams → BaseDeidParams
def getValidAgeRanges: Array[Int]
Gets valid ageRanges whether ageRangesByHipaa is true or not.
Gets valid ageRanges whether ageRangesByHipaa is true or not.

Attributes
protected
Definition Classes
BaseDeidParams
def getZipCodeTag: String

Definition Classes
DeIdentificationParams
val groupByCol: Param[String]
The column name used to group the dataset.
The column name used to group the dataset. This parameter is used in conjunction with consistentObfuscation to ensure consistent obfuscation within each group. When groupByCol is set, the dataset is partitioned into groups based on the values of the specified column.
Default: "" (empty string, meaning no grouping)
- The column name must be a valid string in the input dataset.
- The column must be of StringType.
Definition Classes
DeIdentificationParams
Note
This functionality can change order of the dataset, so it is recommended to use it with caution.
,
This functionality cannot be supported by LightPipeline.
final def hasDefault[T](param: Param[T]): Boolean

Definition Classes
Params
def hasParam(paramName: String): Boolean

Definition Classes
Params
def hashCode(): Int

Definition Classes
AnyRef → Any
Annotations
@native()
val ignoreRegex: BooleanParam
Select if you want to use regex file loaded in the model.
Select if you want to use regex file loaded in the model. If true the default regex file will be not used The default value is false.

Definition Classes
DeIdentificationParams
def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean

Attributes
protected
Definition Classes
Logging
def initializeLogIfNecessary(isInterpreter: Boolean): Unit

Attributes
protected
Definition Classes
Logging
val inputAnnotatorTypes: Array[AnnotatorType]
Input annotator type: DOCUMENT, TOKEN, CHUNK
Input annotator type: DOCUMENT, TOKEN, CHUNK

Definition Classes
DeIdentification → HasInputAnnotationCols
final val inputCols: StringArrayParam

Attributes
protected
Definition Classes
HasInputAnnotationCols
def isArabic: Boolean

Attributes
protected
Definition Classes
MaskingParams
final def isDefined(param: Param[_]): Boolean

Definition Classes
Params
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
val isRandomDateDisplacement: BooleanParam
Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities,if false use the DeIdentificationParams.days The default value is false.
Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities,if false use the DeIdentificationParams.days The default value is false.

Definition Classes
DeIdentificationParams
final def isSet(param: Param[_]): Boolean

Definition Classes
Params
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
val keepMonth: BooleanParam
Whether to keep the month intact when obfuscating date entities.
Whether to keep the month intact when obfuscating date entities. If true, the month will remain unchanged during the obfuscation process. If false, the month will be modified along with the year and day. Default: false.

Definition Classes
BaseDeidParams
val keepTextSizeForObfuscation: BooleanParam
It specifies whether the output should maintain the same character length as the input text.
It specifies whether the output should maintain the same character length as the input text. the output text will remain the same if same length is available, else length might vary.

Definition Classes
BaseDeidParams
val keepYear: BooleanParam
Whether to keep the year intact when obfuscating date entities.
Whether to keep the year intact when obfuscating date entities. If true, the year will remain unchanged during the obfuscation process. If false, the year will be modified along with the month and day. Default: false.

Definition Classes
BaseDeidParams
val language: Param[String]
The language used to select the regex file and some faker entities.
The language used to select the regex file and some faker entities. 'en'(English),'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian) Default:'en'

Definition Classes
BaseDeidParams
val lazyAnnotator: BooleanParam

Definition Classes
CanBeLazy
def loadThreeInputResource(er: ExternalResource): Array[StaticObfuscationEntity]

Attributes
protected
Definition Classes
DeidApproachParams
def log: Logger

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
val mappingsColumn: Param[String]
This is the mapping column that will return the Annotations chunks with the fake entities
This is the mapping column that will return the Annotations chunks with the fake entities

Definition Classes
DeIdentificationParams
def maskEntity(wordToReplace: String, entityClass: String): String

Attributes
protected
Definition Classes
MaskingParams
def maskEntity(annotation: Annotation, entityClass: String): String

Attributes
protected
Definition Classes
MaskingParams
def maskEntityWithPolicy(wordToReplace: String, maskingPolicy: String, entityClass: String): String

Attributes
protected
Definition Classes
MaskingParams
def maskEntityWithPolicy(annotation: Annotation, maskingPolicy: String, entityClass: String): String

Attributes
protected
Definition Classes
MaskingParams
val maskingPolicy: Param[String]
Select the masking policy:
Select the masking policy:
- 'entity_labels': Replace the values with the entity value.
- 'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
- 'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
- 'entity_labels_without_brackets': Replace the values with the entity value without brackets.
- 'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
- Default: 'entity_labels'
Definition Classes
MaskingParams
val metadataMaskingPolicy: Param[String]
If specified, the metadata includes the masked form of the document.
If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:
- 'entity_labels': Replace the values with the entity value.
- 'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
- 'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
- 'entity_labels_without_brackets': Replace the values with the entity value without brackets.
- 'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
- Default: ""
Definition Classes
DeIdentificationParams
val minYear: IntParam
Minimum year to use when converting date to year
Minimum year to use when converting date to year

Definition Classes
DeIdentificationParams
val mode: Param[String]
Mode for Anonymizer ['mask' or 'obfuscate'].
Mode for Anonymizer ['mask' or 'obfuscate']. Default: 'mask'
- Mask mode: The entities will be replaced by their entity types.
- Obfuscate mode: The entity is replaced by an obfuscator's term.
Definition Classes
BaseDeidParams
Example:
1. Given the following text: "David Hale visited EEUU a couple of years ago"
  Mask mode: "<PERSON> visited <COUNTRY> a couple of years ago"
  Obfuscate mode: "Bryan Johnson visited Japan a couple of years ago"
def msgHelper(schema: StructType): String

Attributes
protected
Definition Classes
HasInputAnnotationCols
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
Annotations
@native()
final def notifyAll(): Unit

Definition Classes
AnyRef
Annotations
@native()
val obfuscateByAgeGroups: BooleanParam
Whether to obfuscate ages based on age groups.
Whether to obfuscate ages based on age groups.
When true, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When false, the age ranges specified in the ageRanges parameter will be used to obfuscate ages. Default: false.

Definition Classes
DeIdentificationParams
val obfuscateDate: BooleanParam
When mode=="obfuscate" whether to obfuscate dates or not.
When mode=="obfuscate" whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs. If the value is true and obfuscation is failed, then unnormalizedDateMode will be activated. When setting to 'false', then the date will be masked to <DATE>. Default: false

Definition Classes
BaseDeidParams
val obfuscateRefFile: Param[String]
File with the terms to be used for Obfuscation
File with the terms to be used for Obfuscation

Definition Classes
DeidApproachParams
val obfuscateRefSource: Param[String]
The source of obfuscation to obfuscate the entities.
The source of obfuscation to obfuscate the entities. The values ar the following: 'file': Takes the entities from the obfuscatorRefFile 'faker': Takes the entities from the Faker module 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.

Definition Classes
BaseDeidParams
val obfuscationEquivalents: StructFeature[Array[StaticObfuscationEntity]]
variant-to-canonical entity mappings to ensure consistent obfuscation.
variant-to-canonical entity mappings to ensure consistent obfuscation.
This method allows you to define equivalence rules for entity variants that should be obfuscated the same way. For example, the names "Alex" and "Alexander" will always be mapped to the same obfuscated value if they are linked to the same canonical form.
It accepts an array of string triplets, where each triplet defines:
- variant: A non-standard, short, or alternative form of a value (e.g., "Alex")
- entityType: The type of the entity (e.g., "NAME", "STATE", "COUNTRY")
- canonical: The standardized form all variants map to (e.g., "Alexander")
variant and entityType comparisons are case-insensitive during processing.
This is especially useful in de-identification tasks to ensure consistent replacement of semantically identical values. It also allows cross-variant normalization across different occurrences of sensitive data.
Definition Classes
BaseDeidParams
val obfuscationEquivalentsResource: ExternalResourceParam
variant-to-canonical entity mappings to ensure consistent obfuscation.
variant-to-canonical entity mappings to ensure consistent obfuscation.
This method allows you to define equivalence rules for entity variants that should be obfuscated the same way. For example, the names "Alex" and "Alexander" will always be mapped to the same obfuscated value if they are linked to the same canonical form.
It accepts an array of string triplets, where each triplet defines:
- variant: A non-standard, short, or alternative form of a value (e.g., "Alex")
- entityType: The type of the entity (e.g., "NAME", "STATE", "COUNTRY")
- canonical: The standardized form all variants map to (e.g., "Alexander")
This is especially useful in de-identification tasks to ensure consistent replacement of semantically identical values. It also allows cross-variant normalization across different occurrences of sensitive data.
val obfuscationStrategyOnException: Param[String]
The obfuscation strategy to be applied when an exception occurs.
The obfuscation strategy to be applied when an exception occurs.
The obfuscation strategy determines how obfuscation is handled in case of an exception. Four possible values are supported:
- "mask": The original chunk is replaced with a masking pattern.
- "default": The original chunk is replaced with a default faker.
- "skip": The original chunk is not replaced with any faker.
- "exception": Throws the exception.
The default obfuscation strategy is "default".
Definition Classes
DeIdentificationParams
def onTrained(model: DeIdentificationModel, spark: SparkSession): Unit

Definition Classes
AnnotatorApproach
val optionalInputAnnotatorTypes: Array[String]

Definition Classes
HasInputAnnotationCols
val outputAnnotatorType: AnnotatorType
Output annotator types: DOCUMENT
Output annotator types: DOCUMENT

Definition Classes
DeIdentification → HasOutputAnnotatorType
val outputAsDocument: BooleanParam
Whether to return all sentences joined into a single document
Whether to return all sentences joined into a single document

Definition Classes
DeIdentificationParams
final val outputCol: Param[String]

Attributes
protected
Definition Classes
HasOutputAnnotationCol
lazy val params: Array[Param[_]]

Definition Classes
Params
lazy val randomDateFormat: String

Attributes
protected
Definition Classes
BaseDeidParams
val refFileFormat: Param[String]
Format of the reference file for Obfuscation the default value for that is "csv"
Format of the reference file for Obfuscation the default value for that is "csv"

Definition Classes
DeidApproachParams
val refSep: Param[String]
Separator character for the csv reference file for Obfuscation de default value is "#"
Separator character for the csv reference file for Obfuscation de default value is "#"

Definition Classes
DeidApproachParams
val regexOverride: BooleanParam
If the value is true, prioritize the regex entities; if the value is false, prioritize the ner.
If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false. If DeIdentification.combineRegexPatterns is true, this value will be invalid.

Definition Classes
DeIdentificationParams
val regexPatternsDictionary: ExternalResourceParam
dictionary with regular expression patterns that match some protected entity if the dictionary in not setting up we will use the default regex file.
val regexPatternsDictionaryAsJsonString: Param[String]
dictionary with regular expression patterns given as JSON that match some protected entity if the dictionary is not setting up we will use the default regex file.
val region: Param[String]
With this property, you can select particular dateFormats.
With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates.
- The values are following:
- 'eu' for European Union
- 'us' for USA
Definition Classes
BaseDeidParams
val returnEntityMappings: BooleanParam
With this property, you can select if you want to return mapping column.
With this property, you can select if you want to return mapping column.

Definition Classes
DeIdentificationParams
val sameEntityThreshold: DoubleParam
Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn't apply.
Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn't apply.

Definition Classes
DeIdentificationParams
val sameLengthFormattedEntities: StringArrayParam
List of formatted entities to generate the same length outputs as original ones during obfuscation.
List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: "phone", "fax", "contact," "id", "idnum", "bioid", "medicalrecord", "zip", "vin", "ssn", "dln", "plate", "license", "IRS", "CFN", "account".

Definition Classes
BaseDeidParams
def save(path: String): Unit

Definition Classes
MLWritable
Annotations
@Since( "1.6.0" ) @throws( ... )
val seed: IntParam
It is the seed to select the entities on obfuscate mode.
It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Definition Classes
BaseDeidParams
val selectiveObfuscateRefSource: MapFeature[String, String]
A map of entity names to their obfuscation modes.
A map of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source.
Definition Classes
BaseDeidParams
Example:
1. val selectiveSources = Map( "PHONE" -> "file", "EMAIL" -> "faker", "NAME" -> "faker", "ADDRESS" -> "both" )
val selectiveObfuscationModes: StructFeature[Map[String, Array[String]]]
The dictionary of modes to enable multi-mode deidentification.
The dictionary of modes to enable multi-mode deidentification.
- 'obfuscate': Replace the values with random values.
- 'mask_same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.
- 'mask_entity_labels': Replace the values with the entity value.
- 'mask_fixed_length_chars': Replace the name with the asterix with fixed length. You can also invoke "setFixedMaskLength()"
- 'mask_entity_labels_without_brackets': Replace the values with the entity value without brackets.
- 'mask_same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
- 'skip': Skip the entities (intact)
The entities which have not been given in dictionary will deidentify according to setMode()
Definition Classes
BaseDeidParams
val selectiveObfuscationModesPath: Param[String]
Dictionary path where is the json that contains the selective obfuscation modes
def set[T](feature: StructFeature[T], value: T): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
def set[K, V](feature: MapFeature[K, V], value: Map[K, V]): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
def set[T](feature: SetFeature[T], value: Set[T]): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
def set[T](feature: ArrayFeature[T], value: Array[T]): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
final def set(paramPair: ParamPair[_]): DeIdentification.this.type

Attributes
protected
Definition Classes
Params
final def set(param: String, value: Any): DeIdentification.this.type

Attributes
protected
Definition Classes
Params
final def set[T](param: Param[T], value: T): DeIdentification.this.type

Definition Classes
Params
def setAdditionalDateFormats(formats: Array[String]): DeIdentification.this.type
Sets additionalDateFormats param
Sets additionalDateFormats param

Definition Classes
BaseDeidParams
def setAgeGroups(value: Map[String, Array[Int]]): DeIdentification.this.type
Sets the age groups to obfuscate ages.
Sets the age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges. The map should contain the age group name as the key and an array of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:
```
Map(
"baby" -> Array(0, 1),
"toddler" -> Array(1, 3),
"child" -> Array(3, 12),
"teenager" -> Array(12, 20),
"adult" -> Array(20, 65),
"senior" -> Array(65, 200)
)
```
Definition Classes
DeIdentificationParams
Exceptions thrown
IllegalArgumentException if the value is empty, contains negative values, or is not a pair of integers
def setAgeGroups(value: HashMap[String, ArrayList[Int]]): DeIdentification.this.type

Definition Classes
DeIdentificationParams
def setAgeRanges(mode: Array[Int]): DeIdentification.this.type
List of integers specifying limits of the age groups to preserve during obfuscation
List of integers specifying limits of the age groups to preserve during obfuscation

Definition Classes
BaseDeidParams
def setAgeRangesByHipaa(value: Boolean): DeIdentification.this.type
Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.
value
If true, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If false, ageRanges parameter is valid. Default: false.

Definition Classes
BaseDeidParams
def setBlackList(list: Array[String]): DeIdentification.this.type
List of entities that will be ignored to in the regex file.
List of entities that will be ignored to in the regex file. The rest will be processed. The default values are "IBAN","ZIP","NPI","DLN","PASSPORT","C_CARD","DEA","SSN", "IP", "DEA".

Definition Classes
DeIdentificationParams
def setBlackListEntities(value: Array[String]): DeIdentification.this.type
Sets the list of entities coming from NER or regex rules that will be ignored for masking or obfuscation.
Sets the list of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Defaults to an empty array.

Definition Classes
DeIdentificationParams
def setChunkMatching(categories: HashMap[String, Double]): DeIdentification.this.type

Definition Classes
DeIdentificationParams
def setChunkMatching(value: Map[String, Double]): DeIdentification.this.type
Performs entity chunk matching across rows or within groups in a DataFrame.
Performs entity chunk matching across rows or within groups in a DataFrame. Useful in de-identification pipelines where certain entity labels like "NAME" or "DATE" may be missing in some rows and need to be filled from other rows in the same group.
Notes:
- When applying the method across multiple rows, the usage of groupByCol parameter is required.
Definition Classes
DeIdentificationParams
def setCombineRegexPatterns(s: Boolean): DeIdentification.this.type
If the value is true both loaded regex file and default regex file are used together; if the value is false, either loaded regex file or default regex file is used.
If the value is true both loaded regex file and default regex file are used together; if the value is false, either loaded regex file or default regex file is used. The default value is false If the value is true, it uses the default regex file regardless of the value of the regexOverride.
def setConsistentAcrossNameParts(value: Boolean): DeIdentification.this.type
Sets the value of consistentAcrossNameParts.
Sets the value of consistentAcrossNameParts.
value
Boolean flag to enforce consistency across name parts
returns
this instance

Definition Classes
BaseDeidParams
def setConsistentObfuscation(s: Boolean): DeIdentification.this.type
Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.
Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

Definition Classes
DeIdentificationParams
def setCountryObfuscation(value: Boolean): DeIdentification.this.type
Sets whether to obfuscate country entities or not.
Sets whether to obfuscate country entities or not. If true, country entities will be obfuscated using the Faker module. If false, country entities will be skipped during obfuscation. Default: false

Definition Classes
BaseDeidParams
def setDateEntities(value: Array[String]): DeIdentification.this.type
Sets the value of dateEntities.
Sets the value of dateEntities. Default: Array("DATE", "DOB", "DOD", "EFFDATE", "FISCAL_YEAR")

Definition Classes
BaseDeidParams
def setDateFormats(s: Array[String]): DeIdentification.this.type
Format of dates to displace
Format of dates to displace

Definition Classes
BaseDeidParams
def setDateTag(s: String): DeIdentification.this.type
Tag representing what are the NER entity (default: DATE)
Tag representing what are the NER entity (default: DATE)

Definition Classes
DeIdentificationParams
def setDateToYear(s: Boolean): DeIdentification.this.type
true if dates must be converted to years, false otherwise
true if dates must be converted to years, false otherwise

Definition Classes
DeIdentificationParams
def setDays(k: Int): DeIdentification.this.type
Number of days to obfuscate the dates by displacement.
Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used

Definition Classes
BaseDeidParams
def setDefault[T](feature: StructFeature[T], value: () ⇒ T): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
def setDefault[K, V](feature: MapFeature[K, V], value: () ⇒ Map[K, V]): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
def setDefault[T](feature: SetFeature[T], value: () ⇒ Set[T]): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
def setDefault[T](feature: ArrayFeature[T], value: () ⇒ Array[T]): DeIdentification.this.type

Attributes
protected
Definition Classes
HasFeatures
final def setDefault(paramPairs: ParamPair[_]*): DeIdentification.this.type

Attributes
protected
Definition Classes
Params
final def setDefault[T](param: Param[T], value: T): DeIdentification.this.type

Attributes
protected[org.apache.spark.ml]
Definition Classes
Params
def setDoExceptionHandling(value: Boolean): DeIdentification.this.type
If true, exceptions are handled.
If true, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Definition Classes
HandleExceptionParams
def setEnableDefaultObfuscationEquivalents(value: Boolean): DeIdentification.this.type
Sets whether to enable default obfuscation equivalents for common entities.
Sets whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default: false

Definition Classes
BaseDeidParams
def setEntityCasingModesPath(path: String): DeIdentification.this.type
Dictionary path where is the json that contains the entity casing modes.
Dictionary path where is the json that contains the entity casing modes. 'lowercase': Converts all characters to lower case using the rules of the default locale. 'uppercase': Converts all characters to upper case using the rules of the default locale. 'capitalize': Converts the first character to upper case and converts others to lower case. 'titlecase': Converts the first character in every token to upper case and converts others to lower case.
def setFakerLengthOffset(value: Int): DeIdentification.this.type
Sets fakerLengthOffset param
Sets fakerLengthOffset param

Definition Classes
BaseDeidParams
def setFixedMaskLength(value: Int): DeIdentification.this.type
Sets the value of fixedMaskLength.
Sets the value of fixedMaskLength. This is the length of the masking sequence that will be used when the 'fixed_length_chars' masking policy is selected.

Definition Classes
MaskingParams
def setGenderAwareness(value: Boolean): DeIdentification.this.type
Whether to use gender-aware names or not during obfuscation.
Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Definition Classes
BaseDeidParams
def setGeoConsistency(value: Boolean): DeIdentification.this.type
Sets the value of geoConsistency.
Sets the value of geoConsistency. When set to true, it enables consistent obfuscation across geographical entities such as state, city, street, zip, and phone.

Definition Classes
BaseDeidParams
def setGroupByCol(value: String): DeIdentification.this.type
Sets groupByCol param to group the dataset.
Sets groupByCol param to group the dataset. This parameter is used in conjunction with consistentObfuscation to ensure consistent obfuscation within each group.

Definition Classes
DeIdentificationParams
Note
This functionality can change order of the dataset, so it is recommended to use it with caution.
,
This functionality cannot be supported by LightPipeline.
def setIgnoreRegex(s: Boolean): DeIdentification.this.type
Select if you want to use regex file loaded in the model.
Select if you want to use regex file loaded in the model. If true the default regex file will be not used The default value is false.

Definition Classes
DeIdentificationParams
final def setInputCols(value: String*): DeIdentification.this.type

Definition Classes
HasInputAnnotationCols
def setInputCols(value: Array[String]): DeIdentification.this.type

Definition Classes
HasInputAnnotationCols
def setIsRandomDateDisplacement(s: Boolean): DeIdentification.this.type
Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities, if false use the DeIdentificationParams.days The default value is false.
Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities, if false use the DeIdentificationParams.days The default value is false.

Definition Classes
DeIdentificationParams
def setKeepMonth(value: Boolean): DeIdentification.this.type
Sets whether to keep the month intact when obfuscating date entities.
Sets whether to keep the month intact when obfuscating date entities. If true, the month will remain unchanged during the obfuscation process. If false, the month will be modified along with the year and day. Default: false.

Definition Classes
BaseDeidParams
def setKeepTextSizeForObfuscation(value: Boolean): DeIdentification.this.type
Sets keepTextSizeForObfuscation param
Sets keepTextSizeForObfuscation param

Definition Classes
BaseDeidParams
def setKeepYear(value: Boolean): DeIdentification.this.type
Sets whether to keep the year intact when obfuscating date entities.
Sets whether to keep the year intact when obfuscating date entities. If true, the year will remain unchanged during the obfuscation process. If false, the year will be modified along with the month and day. Default: false.

Definition Classes
BaseDeidParams
def setLanguage(s: String): DeIdentification.this.type
The language used to select the regex file and some faker entities.
The language used to select the regex file and some faker entities. 'en'(English),'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian). Default:'en'

Definition Classes
BaseDeidParams
def setLazyAnnotator(value: Boolean): DeIdentification.this.type

Definition Classes
CanBeLazy
def setMappingsColumn(s: String): DeIdentification.this.type
This is the mapping column that will return the Annotations chunks with the fake entities
This is the mapping column that will return the Annotations chunks with the fake entities

Definition Classes
DeIdentificationParams
def setMaskingPolicy(value: String): DeIdentification.this.type
Select the masking policy:
Select the masking policy:
- 'entity_labels': Replace the values with the entity value.
- 'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
- 'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
- 'entity_labels_without_brackets': Replace the values with the entity value without brackets.
- 'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
- Default: 'entity_labels'
Definition Classes
MaskingParams
def setMetadataMaskingPolicy(value: String): DeIdentification.this.type
If specified, the metadata includes the masked form of the document.
If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:
- 'entity_labels': Replace the values with the entity value.
- 'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
- 'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
- 'entity_labels_without_brackets': Replace the values with the entity value without brackets.
- 'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
- Default: ""
Definition Classes
DeIdentificationParams
def setMinYear(s: Int): DeIdentification.this.type
Minimum year to use when converting date to year
Minimum year to use when converting date to year

Definition Classes
DeIdentificationParams
def setMode(m: String): DeIdentification.this.type
Mode for Anonymizer ['mask'|'obfuscate'].
Mode for Anonymizer ['mask'|'obfuscate']. Default: 'mask'
- Mask mode: The entities will be replaced by their entity types.
- Obfuscate mode: The entity is replaced by an obfuscator's term.
Definition Classes
BaseDeidParams
Example:
1. Given the following text: "David Hale visited EEUU a couple of years ago"
  Mask mode: "<PERSON> visited <COUNTRY> a couple of years ago"
  Obfuscate mode: "Bryan Johnson visited Japan a couple of years ago"
def setObfuscateByAgeGroups(value: Boolean): DeIdentification.this.type
Sets whether to obfuscate ages based on age groups.
Sets whether to obfuscate ages based on age groups.
When true, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When false, the age ranges specified in the ageRanges parameter will be used to obfuscate ages. Default: false.

Definition Classes
DeIdentificationParams
def setObfuscateDate(s: Boolean): DeIdentification.this.type
When mode=="obfuscate" whether to obfuscate dates or not.
When mode=="obfuscate" whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs. If the value is true and obfuscation is failed, then unnormalizedDateMode will be activated. When setting to 'false' then the date will be masked to <DATE> . Default: false

Definition Classes
BaseDeidParams
def setObfuscateRefFile(f: String): DeIdentification.this.type
File with the terms to be used for Obfuscation
File with the terms to be used for Obfuscation

Definition Classes
DeidApproachParams
def setObfuscateRefSource(s: String): DeIdentification.this.type
The source of obfuscation to obfuscate the entities.
The source of obfuscation to obfuscate the entities. The values are the following: 'file': Takes the fakes from the obfuscatorRefFile 'faker': Takes the fakes from the Faker module 'both': Takes the fakes from the obfuscatorRefFile and the faker module randomly.

Definition Classes
BaseDeidParams
def setObfuscationEquivalents(equivalents: ArrayList[ArrayList[String]]): DeIdentification.this.type

Definition Classes
BaseDeidParams
def setObfuscationEquivalents(equivalents: Array[Array[String]]): DeIdentification.this.type
Sets variant-to-canonical entity mappings to ensure consistent obfuscation.
Sets variant-to-canonical entity mappings to ensure consistent obfuscation.
This method allows you to define equivalence rules for entity variants that should be obfuscated the same way. For example, the names "Alex" and "Alexander" will always be mapped to the same obfuscated value if they are linked to the same canonical form.
It accepts an array of string triplets, where each triplet defines:
- variant: A non-standard, short, or alternative form of a value (e.g., "Alex")
- entityType: The type of the entity (e.g., "NAME", "STATE", "COUNTRY")
- canonical: The standardized form all variants map to (e.g., "Alexander")
variant and entityType comparisons are case-insensitive during processing.
This is especially useful in de-identification tasks to ensure consistent replacement of semantically identical values. It also allows cross-variant normalization across different occurrences of sensitive data.
Example
```
val equivalents = Array(
  Array("Alex", "NAME", "Alexander"),
  Array("Rob", "NAME", "Robert"),
  Array("CA", "STATE", "California"),
  Array("Calif.", "STATE", "California")
)

myDeidTransformer.setObfuscationEquivalents(equivalents)
```
equivalents
Array of [variant, entityType, canonical] entries.

Definition Classes
BaseDeidParams
Exceptions thrown
IllegalArgumentException if any entry does not have exactly 3 elements.
def setObfuscationEquivalents(equivalents: Array[StaticObfuscationEntity]): DeIdentification.this.type
Sets obfuscationEquivalents param.
Sets obfuscationEquivalents param.

Definition Classes
BaseDeidParams
def setObfuscationEquivalentsResource(path: String, readAs: Format, options: Map[String, String] = ...): DeIdentification.this.type
Sets the static obfuscation pairs resource using a path and readAs format.
Sets the static obfuscation pairs resource using a path and readAs format. The resource should contain three columns: variant, entityType, and canonical. The delimiter for the columns can be specified in the options.
path
The path to the resource.
readAs
The format to read the resource (e.g., TEXT, SPARK).
options
Additional options for reading the resource, such as "format" and "delimiter". 'delimiter' and 'format' are required options.
def setObfuscationEquivalentsResource(value: ExternalResource): DeIdentification.this.type
Sets the obfuscation equivalents resource.
Sets the obfuscation equivalents resource. The resource should contain three columns: variant, entityType, and canonical. The delimiter for the columns can be specified in the options.
def setObfuscationStrategyOnException(value: String): DeIdentification.this.type
Sets the obfuscation strategy to be applied when an exception occurs.
Sets the obfuscation strategy to be applied when an exception occurs.
The obfuscation strategy determines how obfuscation is handled in case of an exception. Four possible values are supported:
- "mask": The original chunk is replaced with a masking pattern.
- "default": The original chunk is replaced with a default faker.
- "skip": The original chunk is not replaced with any faker.
- "exception": Throws the exception.
The default obfuscation strategy is "default".
Definition Classes
DeIdentificationParams
def setOutputAsDocument(mode: Boolean): DeIdentification.this.type
Whether to return all sentences joined into a single document
Whether to return all sentences joined into a single document

Definition Classes
DeIdentificationParams
final def setOutputCol(value: String): DeIdentification.this.type

Definition Classes
HasOutputAnnotationCol
def setRefFileFormat(f: String): DeIdentification.this.type
File with the terms to be used for Obfuscation
File with the terms to be used for Obfuscation

Definition Classes
DeidApproachParams
def setRefSep(f: String): DeIdentification.this.type
Separator character for the csv reference file for Obfuscation de default value is "#"
Separator character for the csv reference file for Obfuscation de default value is "#"

Definition Classes
DeidApproachParams
def setRegexOverride(s: Boolean): DeIdentification.this.type
If the value is true, prioritize the regex entities; if the value is false, prioritize the ner.
If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false. If DeIdentification.combineRegexPatterns is true, this value will be invalid.

Definition Classes
DeIdentificationParams
def setRegexPatternsDictionary(path: String, readAs: Format = ReadAs.TEXT, options: Map[String, String] = Map()): DeIdentification.this.type
dictionary with regular expression patterns that match some protected entity.When the field is not set then a default regex file will be used.
dictionary with regular expression patterns that match some protected entity.When the field is not set then a default regex file will be used.
path
the string path where the file is allocated.
readAs
Format of the the reader
options
options to apply to the reader.
def setRegexPatternsDictionary(path: ExternalResource): DeIdentification.this.type
dictionary with regular expression patterns that match some protected entity.When the field is not set then a default regex file will be used.
dictionary with regular expression patterns that match some protected entity.When the field is not set then a default regex file will be used.
path
the external resource where the file is allocated.

See also
ExternalResource
def setRegexPatternsDictionaryAsJsonString(json: String): DeIdentification.this.type
dictionary with regular expression patterns given as JSON that match some protected entity.When the field is not set then a default regex file will be used.
dictionary with regular expression patterns given as JSON that match some protected entity.When the field is not set then a default regex file will be used.
json
regex(s) as JSON format
def setRegion(s: String): DeIdentification.this.type
With this property, you can select particular dateFormats.
With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. The values are following:
- 'eu' for European Union
- 'us' for USA
Definition Classes
BaseDeidParams
def setReturnEntityMappings(s: Boolean): DeIdentification.this.type
With this property, you can select if you want to return mapping column.
With this property, you can select if you want to return mapping column.

Definition Classes
DeIdentificationParams
def setSameEntityThreshold(s: Double): DeIdentification.this.type
Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn't apply.
Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn't apply.

Definition Classes
DeIdentificationParams
def setSameLengthFormattedEntities(entities: Array[String]): DeIdentification.this.type
List of formatted entities to generate the same length outputs as original ones during obfuscation.
List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, CONTACT, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE, IRS, CFN, ACCOUNT.

Definition Classes
BaseDeidParams
def setSeed(s: Int): DeIdentification.this.type
It is the seed to select the entities on obfuscate mode.
It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Definition Classes
BaseDeidParams
def setSelectiveObfuscateRefSource(value: HashMap[String, String]): DeIdentification.this.type

Definition Classes
BaseDeidParams
def setSelectiveObfuscateRefSource(value: Map[String, String]): DeIdentification.this.type
Sets the value of selectiveObfuscateRefSource.
Sets the value of selectiveObfuscateRefSource. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation method. The values can be: - 'file': Takes the fakes from the file. - 'faker': Takes the fakes from the embedded faker module. - 'both': Takes the fakes from the file and the faker module.
Definition Classes
BaseDeidParams
Example:
1. val modes = Map( "PHONE" -> "file", "EMAIL" -> "faker", "NAME" -> "faker", "ADDRESS" -> "both" )
def setSelectiveObfuscationModes(value: HashMap[String, List[String]]): DeIdentification.this.type

Definition Classes
BaseDeidParams
def setSelectiveObfuscationModes(value: Map[String, Array[String]]): DeIdentification.this.type
Sets the value of selectiveObfuscationModes.
Sets the value of selectiveObfuscationModes. The dictionary of modes to enable multi-mode deidentification.
- 'obfuscate': Replace the values with random values.
- 'mask_same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.
- 'mask_entity_labels': Replace the values with the entity value.
- 'mask_fixed_length_chars': Replace the name with the asterix with fixed length. You should also invoke "setFixedMaskLength()"
- 'mask_entity_labels_without_brackets': Replace the values with the entity value without brackets.
- 'mask_same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
- 'skip': Skip the entities (intact)
The entities which have not been given in dictionary will deidentify according to setMode()
Example:
```
deidAnnotator
.setMode("mask")
.setSelectiveObfuscationModes(Map(
    "OBFUSCATE" -> Array("PHONE", "email"),
    "mask_entity_labels" -> Array("NAME", "CITY"),
    "skip" -> Array("id", "idnum"),
    "mask_same_length_chars" -> Array("fax"),
    "mask_fixed_length_chars" -> Array("zip")
))
.setFixedMaskLength(4)
```
Definition Classes
BaseDeidParams
def setSelectiveObfuscationModesPath(path: String): DeIdentification.this.type
Dictionary path where is the json that contains the selective obfuscation modes
def setStaticObfuscationPairs(pairs: ArrayList[ArrayList[String]]): DeIdentification.this.type

Definition Classes
BaseDeidParams
def setStaticObfuscationPairs(pairs: Array[StaticObfuscationEntity]): DeIdentification.this.type

Definition Classes
BaseDeidParams
def setStaticObfuscationPairs(pairs: Array[Array[String]]): DeIdentification.this.type
Sets the static obfuscation pairs.
Sets the static obfuscation pairs. Each pair should contain three elements: original, entity type, and fake. The pairs must have exactly 3 elements: [original, entityType, fake].
pairs
An array of arrays containing the static obfuscation pairs.

Definition Classes
BaseDeidParams
def setStaticObfuscationPairsResource(path: String, readAs: Format, options: Map[String, String] = ...): DeIdentification.this.type
Sets the static obfuscation pairs resource using a path and readAs format.
Sets the static obfuscation pairs resource using a path and readAs format. The resource should contain three columns: original, entity, and fake. The delimiter for the columns can be specified in the options.
path
The path to the resource.
readAs
The format to read the resource (e.g., TEXT, SPARK).
options
Additional options for reading the resource, such as "format" and "delimiter". 'delimiter' and 'format' are required options.

Definition Classes
DeidApproachParams
def setStaticObfuscationPairsResource(value: ExternalResource): DeIdentification.this.type
Sets the static obfuscation pairs resource.
Sets the static obfuscation pairs resource. The resource should contain three columns: original, entity, and fake. The delimiter for the columns can be specified in the options. The options must include a 'delimiter' key.

Definition Classes
DeidApproachParams
def setUnnormalizedDateMode(mode: String): DeIdentification.this.type
The mode to use if the date is not formatted.
The mode to use if the date is not formatted. Options: [mask, obfuscate, skip] Default: obfuscate

Definition Classes
BaseDeidParams
def setUseShiftDays(s: Boolean): DeIdentification.this.type
Sets the value of useShiftDays.
Sets the value of useShiftDays. Whether to use the random shift day when the document has this in its metadata. DocumentHashCoder can create 'dateshift' based on the document. Default: false

Definition Classes
DeIdentificationParams → BaseDeidParams
def setZipCodeTag(s: String): DeIdentification.this.type

Definition Classes
DeIdentificationParams
val staticObfuscationPairs: StructFeature[Array[StaticObfuscationEntity]]
A resource containing static obfuscation pairs.
A resource containing static obfuscation pairs. Each pair should contain three elements: original, entity type, and fake.

Definition Classes
BaseDeidParams
val staticObfuscationPairsResource: ExternalResourceParam
Resource containing static obfuscation pairs.
Resource containing static obfuscation pairs. The resource should contain three columns: original, entity, and fake. The delimiter for the columns can be specified in the options.

Definition Classes
DeidApproachParams
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
Identifiable → AnyRef → Any
def train(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): DeIdentificationModel
Returns the DeIdentificationModel Transformer, that can be used to transform input datasets
Returns the DeIdentificationModel Transformer, that can be used to transform input datasets
The dataset provided to the fit method should have one chunk per row and contain the following columns: Document, Tokens, Chunks
This method is called inside the AnnotatorApproach's fit method
dataset
a Dataset containing ChunkTokens, ChunkEmbeddings, ClassifierLabel, ResolverLabel, [ResolverNormalized]
recursivePipeline
an instance of PipelineModel
returns
a trained DeIdentificationModel

Definition Classes
DeIdentification → AnnotatorApproach
def transformRegexPatternsDictionary(regexPatternsDictionary: Array[(String, String)]): Map[String, Array[String]]
final def transformSchema(schema: StructType): StructType

Definition Classes
AnnotatorApproach → PipelineStage
def transformSchema(schema: StructType, logging: Boolean): StructType

Attributes
protected
Definition Classes
PipelineStage
Annotations
@DeveloperApi()
val uid: String

Definition Classes
DeIdentification → Identifiable
val unnormalizedDateMode: Param[String]
The mode to use if the date is not formatted.
The mode to use if the date is not formatted. Options: [mask, obfuscate, skip] Default: obfuscate

Definition Classes
BaseDeidParams
val useShifDays: BooleanParam
Use shift days : Whether to use the random shift day when the document has this in its metadata.
Use shift days : Whether to use the random shift day when the document has this in its metadata. Default: False

Definition Classes
DeIdentificationParams
val useShiftDays: BooleanParam
Whether to use the random shift day when the document has this in its metadata.
Whether to use the random shift day when the document has this in its metadata. DocumentHashCoder can create 'dateshift' based on the document. Default: false

Definition Classes
BaseDeidParams
def validate(schema: StructType): Boolean

Attributes
protected
Definition Classes
AnnotatorApproach
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... ) @native()
def write: MLWriter

Definition Classes
DefaultParamsWritable → MLWritable
val zipCodeTag: Param[String]

Definition Classes
DeIdentificationParams

Deprecated Value Members

def setUseShiftDayse(s: Boolean): DeIdentification.this.type

Definition Classes
DeIdentificationParams
Annotations
@deprecated
Deprecated
deprecated because of typo

Packages

DeIdentification 

Companion object DeIdentification

class DeIdentification extends AnnotatorApproach[DeIdentificationModel] with DeIdentificationParams with DeidApproachParams with HandleExceptionParams with CheckLicense

Example

Ner entities

Deidentification

Pipeline

Instance Constructors

Type Members

Value Members

Example

Deprecated Value Members

Inherited from CheckLicense

Inherited from HandleExceptionParams

Inherited from DeidApproachParams

Inherited from DeIdentificationParams

Inherited from MaskingParams

Inherited from BaseDeidParams

Inherited from HasFeatures

Inherited from AnnotatorApproach[DeIdentificationModel]

Inherited from CanBeLazy

Inherited from DefaultParamsWritable

Inherited from MLWritable

Inherited from HasOutputAnnotatorType

Inherited from HasOutputAnnotationCol

Inherited from HasInputAnnotationCols

Inherited from Estimator[DeIdentificationModel]

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Parameters

Annotator types

Members

Parameter setters

Parameter getters

DeIdentification