Spark NLP 6.0.4 ScalaDoc - com.johnsnowlabs.nlp.annotators.deid.DeIdentificationModel

final def !=(arg0: Any): Boolean

Definition Classes: AnyRef → Any

final def ##(): Int

Definition Classes: AnyRef → Any

final def $[T](param: Param[T]): T

Attributes: protected
Definition Classes: Params

def $$[T](feature: StructFeature[T]): T

Attributes: protected
Definition Classes: HasFeatures

def $$[K, V](feature: MapFeature[K, V]): Map[K, V]

Attributes: protected
Definition Classes: HasFeatures

def $$[T](feature: SetFeature[T]): Set[T]

Attributes: protected
Definition Classes: HasFeatures

def $$[T](feature: ArrayFeature[T]): Array[T]

Attributes: protected
Definition Classes: HasFeatures

final def ==(arg0: Any): Boolean

Definition Classes: AnyRef → Any

val GEOGRAPHIC_ENTITIES_PRIORITY: Map[String, Int]

Attributes: protected
Definition Classes: DeidModelParams

val GEO_METADATA_KEY: String

Attributes: protected
Definition Classes: DeidModelParams

def _transform(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): DataFrame

Definition Classes: DeIdentificationModel → AnnotatorModel

val additionalDateFormats: StringArrayParam

Additional date formats to be considered during date obfuscation.

Additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default dateFormats.

Definition Classes: BaseDeidParams

def afterAnnotate(dataset: DataFrame): DataFrame

Definition Classes: DeIdentificationModel → AnnotatorModel

val ageGroups: StructFeature[Map[String, Array[Int]]]

A map of age groups to obfuscate ages.

A map of age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges. The map should contain the age group name as the key and an array of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:

Map(
"baby" -> Array(0, 1),
"toddler" -> Array(1, 4),
"child" -> Array(4, 13),
"teenager" -> Array(13, 20),
"adult" -> Array(20, 65),
"senior" -> Array(65, 200)
)

Definition Classes: DeIdentificationParams

val ageRanges: IntArrayParam

List of integers specifying limits of the age groups to preserve during obfuscation

Definition Classes: BaseDeidParams

val ageRangesByHipaa: BooleanParam

A Boolean variable indicating whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.

When true, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. When false, ageRanges parameter is valid.

Definition Classes: BaseDeidParams

val allTerms: MapFeature[String, List[String]]

dictionary, which contains all terms for using later in anonimization function

def annotate(annotations: Seq[Annotation]): Seq[Annotation]

annotations: The annotations per row that we need to obfuscate the document. Annotations should be DOCUMENT, TOKEN, CHUNK. The annotations of kind TOKEN or CHUNK will be have sentence number in the metadata in any of the annotations of kind Document. If the TOKEN or CHUNK have a sentence number in metadata longer that the sentence number on the document annotations the annotator should throw and exception
returns: The annotations of kind Document masked or obfuscated.

Definition Classes: DeIdentificationModel → HasSimpleAnnotate

final def asInstanceOf[T0]: T0

Definition Classes: Any

def beforeAnnotate(dataset: Dataset[_]): Dataset[_]

Definition Classes: DeIdentificationModel → AnnotatorModel

val blackList: StringArrayParam

List of entities that will be ignored in the regex file.

List of entities that will be ignored in the regex file. The rest will be processed. The default values are "IBAN","ZIP","NPI","DLN","PASSPORT","C_CARD","DEA","SSN", "IP", "DEA".

Definition Classes: DeIdentificationParams

val blackListEntities: StringArrayParam

List of entities coming from NER or regex rules that will be ignored for masking or obfuscation.

List of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Defaults to an empty array.

Definition Classes: DeIdentificationParams

final def checkSchema(schema: StructType, inputAnnotatorType: String): Boolean

Attributes: protected
Definition Classes: HasInputAnnotationCols

def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit

Definition Classes: CheckLicense

def checkValidScope(scope: String): Unit

Definition Classes: CheckLicense

def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes: CheckLicense

def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes: CheckLicense

val chunkMatching: MapFeature[String, Double]

Performs entity chunk matching across rows or within groups in a DataFrame.

Performs entity chunk matching across rows or within groups in a DataFrame. Useful in de-identification pipelines where certain entity labels like "NAME" or "DATE" may be missing in some rows and need to be filled from other rows in the same group.

Definition Classes: DeIdentificationParams
Note: When applying the method across multiple rows, the usage of groupByCol parameter is required.

final def clear(param: Param[_]): DeIdentificationModel.this.type

Definition Classes: Params

def clone(): AnyRef

Attributes: protected[lang]
Definition Classes: AnyRef
Annotations: @throws( ... ) @native()

lazy val combinedDateFormats: Array[String]

Attributes: protected
Definition Classes: BaseDeidParams

val consistentAcrossNameParts: BooleanParam

Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name).

Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name). When set to true, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.

For example, if "John Smith" is obfuscated as "Liam Brown", then:

When the full name "John Smith" appears, it will be replaced with "Liam Brown"
When "John" or "Smith" appear individually, they will still be obfuscated as "Liam" and "Brown" respectively, ensuring consistency in name transformation.

Default: true

Definition Classes: BaseDeidParams

val consistentObfuscation: BooleanParam

Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

Definition Classes: DeIdentificationParams

def copy(extra: ParamMap): DeIdentificationModel

Definition Classes: RawAnnotator → Model → Transformer → PipelineStage → Params

def copyValues[T <: Params](to: T, extra: ParamMap): T

Attributes: protected
Definition Classes: Params

val countryObfuscation: BooleanParam

Whether to obfuscate country entities or not.

Whether to obfuscate country entities or not. If true, country entities will be obfuscated using the Faker module. If false, country entities will be skipped during obfuscation. Default: false

Definition Classes: BaseDeidParams

def createAnonymizeAnnotation(anonymizeSentence: (Sentence, Seq[Annotation]), offset: Int, idx: Int, spacesLength: Int): Annotation

The method that takes anonymized sentence to create proper Annotation

anonymizeSentence: a sentence, which is anonymized
idx: a index of the sentence
returns: a proper Annotation instance

val dateEntities: StringArrayParam

List of date entities.

List of date entities. Default: Array("DATE", "DOB", "DOD", "EFFDATE", "FISCAL_YEAR")

Definition Classes: BaseDeidParams

val dateFormats: StringArrayParam

Format of dates to displace

Definition Classes: BaseDeidParams

val dateTag: Param[String]

Tag representing what are the NER entity (default: DATE)

Definition Classes: DeIdentificationParams

val dateToYear: BooleanParam

true if dates must be converted to years, false otherwise

Definition Classes: DeIdentificationParams

val days: IntParam

Number of days to obfuscate the dates by displacement.

Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used

Definition Classes: BaseDeidParams

final def defaultCopy[T <: Params](extra: ParamMap): T

Attributes: protected
Definition Classes: Params

def dfAnnotate: UserDefinedFunction

Definition Classes: HasSimpleAnnotate

val doExceptionHandling: BooleanParam

If true, exceptions are handled.

If true, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Definition Classes: HandleExceptionParams

val enableDefaultObfuscationEquivalents: BooleanParam

Whether to enable default obfuscation equivalents for common entities.

Whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default: false

Definition Classes: BaseDeidParams

val entityCasingModes: StructFeature[Map[String, Array[String]]]

Dictionary with entity casing modes that match some entities 'lowercase': Converts all characters to lower case using the rules of the default locale.

Dictionary with entity casing modes that match some entities 'lowercase': Converts all characters to lower case using the rules of the default locale. 'uppercase': Converts all characters to upper case using the rules of the default locale. 'capitalize': Converts the first character to upper case and converts others to lower case. 'titlecase': Converts the first character in every token to upper case and converts others to lower case.

final def eq(arg0: AnyRef): Boolean

Definition Classes: AnyRef

def equals(arg0: Any): Boolean

Definition Classes: AnyRef → Any

def explainParam(param: Param[_]): String

Definition Classes: Params

def explainParams(): String

Definition Classes: Params

def extraValidate(structType: StructType): Boolean

Attributes: protected
Definition Classes: RawAnnotator

def extraValidateMsg: String

Attributes: protected
Definition Classes: RawAnnotator

final def extractParamMap(): ParamMap

Definition Classes: Params

final def extractParamMap(extra: ParamMap): ParamMap

Definition Classes: Params

val fakerLengthOffset: IntParam

It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled.

It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Value must be greater than 0. Default is 3.

Definition Classes: BaseDeidParams

val features: ArrayBuffer[Feature[_, _, _]]

Definition Classes: HasFeatures

def finalize(): Unit

Attributes: protected[lang]
Definition Classes: AnyRef
Annotations: @throws( classOf[java.lang.Throwable] )

val fixedMaskLength: IntParam

Select the fixed mask length: this is the length of the masking sequence that will be used when the 'fixed_length_chars' masking policy is selected.

Definition Classes: MaskingParams

val genderAwareness: BooleanParam

Whether to use gender-aware names or not during obfuscation.

Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Definition Classes: BaseDeidParams

def generateFakeBySameLength(wordToReplace: String, entity: String): String

obfuscating digits to new digits, letters to new letters and others remains the same

Definition Classes: DeidModelParams

def generateFakeBySameLengthUsingHash(wordToReplace: String, entity: String): String

Attributes: protected
Definition Classes: DeidModelParams

val geoConsistency: BooleanParam

Whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone.

## Functionality Overview This parameter enables intelligent geographical entity obfuscation that maintains realistic relationships between different geographic components. When enabled, the system ensures that obfuscated addresses form coherent, valid combinations rather than random replacements.

## Supported Entity Types The following geographical entities are processed with priority order: - **state** (Priority: 0) - US state names - **city** (Priority: 1) - City names - **zip** (Priority: 2) - Zip codes - **street** (Priority: 3) - Street addresses - **phone** (Priority: 4) - Phone numbers

## Language Requirement **IMPORTANT**: Geographic consistency is only applied when: - geoConsistency parameter is set to true AND - language parameter is set to en

For non-English configurations, this feature is automatically disabled regardless of the parameter setting.

## Consistency Algorithm When geographical entities comes from the chunk columns:

1. **Entity Grouping**: All geographic entities are identified and grouped by type 2. **Fake Address Selection**: A consistent set of fake US addresses is selected using hash-based deterministic selection to ensure reproducibility 3. **Priority-Based Mapping**: Entities are mapped to fake addresses following the priority order (state → city → zip → street → phone) 4. **Consistent Replacement**: All entities of the same type within a document use the same fake address pool, maintaining geographical coherence

## Parameter Interactions **IMPORTANT**: Enabling this parameter automatically disables: - keepTextSizeForObfuscation - Text size preservation is not maintained - consistentObfuscation - Standard consistency rules are overridden - file-based fakers

This is necessary because geographic consistency requires specific fake address selection that may not preserve original text lengths or follow standard obfuscation patterns.

default: false

Definition Classes: BaseDeidParams

def get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]

Definition Classes: DeIdentificationModel → HasFeatures

def get[T](feature: StructFeature[T]): Option[T]

Attributes: protected
Definition Classes: HasFeatures

def get[T](feature: SetFeature[T]): Option[Set[T]]

Attributes: protected
Definition Classes: HasFeatures

def get[T](feature: ArrayFeature[T]): Option[Array[T]]

Attributes: protected
Definition Classes: HasFeatures

final def get[T](param: Param[T]): Option[T]

Definition Classes: Params

def getAdditionalDateFormats: Array[String]

Gets the value of additionalDateFormats

Definition Classes: BaseDeidParams

def getAgeRanges: Array[Int]

Gets ageRanges param.

Definition Classes: BaseDeidParams

def getAgeRangesByHipaa: Boolean

Gets the value of ageRangesByHipaa.

Definition Classes: BaseDeidParams

def getAllTerms: Map[String, List[String]]

dictionary, which contains all terms for using later in anonimization function

def getAnonymizeSentence(sentence: Sentence, protectedEntities: Seq[Annotation], dateTag: String = "DATE", wholeDocumentDate: Option[Int] = None, zipCodeTag: String = "ZIP", entityMemory: Map[String, String], namePartsMemory: Map[String, String]): (String, Seq[Annotation])

Main point of interest.

Main point of interest. This method projects the sentence into the anonymized form This method is called for each sentence in the input collection of Annotations

sentence: a sentence, which we want to anonymize
protectedEntities: a sequence of Entities which we want to anonymize
dateTag: a String which represents the value with which we replace dates
returns: a String, which represents an anonymized sentence

def getBlackListEntities: Array[String]

Gets blackListEntities param

Definition Classes: DeIdentificationParams

def getChunkMatching: Map[String, Double]

Definition Classes: DeIdentificationParams

def getChunkMatchingAsStr: String

Definition Classes: DeIdentificationParams

final def getClass(): Class[_]

Definition Classes: AnyRef → Any
Annotations: @native()

def getConsistentAcrossNameParts: Boolean

Gets the value of consistentAcrossNameParts.

Definition Classes: BaseDeidParams

def getConsistentObfuscation: Boolean

Definition Classes: DeIdentificationParams

def getCountryObfuscation: Boolean

Gets the value of countryObfuscation.

Definition Classes: BaseDeidParams

def getDateEntities: Array[String]

Gets dateEntities param.

Definition Classes: BaseDeidParams

def getDateFormats: Array[String]

Gets the value of dateFormats

Definition Classes: BaseDeidParams

def getDateTag: String

Definition Classes: DeIdentificationParams

def getDateToYear: Boolean

Definition Classes: DeIdentificationParams

def getDays: Int

Gets days param

Definition Classes: BaseDeidParams

final def getDefault[T](param: Param[T]): Option[T]

Definition Classes: Params

def getDefaultObfuscationEquivalents: Array[StaticObfuscationEntity]

Definition Classes: BaseDeidParams

def getDefaultObfuscationEquivalentsAsJava: Array[ArrayList[String]]

Definition Classes: BaseDeidParams

def getEnableDefaultObfuscationEquivalents: Boolean

Gets the value of enableDefaultObfuscationEquivalents.

Definition Classes: BaseDeidParams

def getEntitiesBySentence(chunks: Seq[Annotation], sentenceCount: Int): Seq[Seq[Annotation]]

Attributes: protected
Definition Classes: DeidModelParams

def getEntityBasedObfuscationRefSource(entityClass: String): String

Attributes: protected
Definition Classes: BaseDeidParams

def getEntityCasingModes: Option[Map[String, Array[String]]]

def getEntityField(annotation: Annotation): String

Attributes: protected
Definition Classes: DeidModelParams

def getExternalFakers(entityClass: String, customFakers: Map[String, List[String]], wordToReplace: String): List[String]

Attributes: protected
Definition Classes: DeidModelParams

def getFakeByHashcode(fakes: Seq[String], wordToReplace: String, entity: String, seed: Int): String

Attributes: protected
Definition Classes: DeidModelParams

def getFakeWithSameSize(fakes: Seq[String], wordToReplace: String, entity: String, lengthDeviation: Int, seed: Int): String

Attributes: protected
Definition Classes: DeidModelParams

def getFakerLengthOffset: Int

Gets fakerLengthOffset param

Definition Classes: BaseDeidParams

def getFakersEntity(entity: String, result: String): Seq[String]

Definition Classes: DeidModelParams

def getFixedMaskLength: Int

Gets fixedMaskLength param.

Definition Classes: MaskingParams

def getGenderAwareness: Boolean

Gets genderAwareness param.

Definition Classes: BaseDeidParams

def getGeoConsistency: Boolean

Gets the value of geoConsistency.

Definition Classes: BaseDeidParams

def getGroupByCol: String

Gets groupByCol param

Definition Classes: DeIdentificationParams

def getIgnoreRegex: Boolean

Definition Classes: DeIdentificationParams

def getInputCols: Array[String]

Definition Classes: HasInputAnnotationCols

def getKeepMonth: Boolean

Gets keepMonth param

Definition Classes: BaseDeidParams

def getKeepTextSizeForObfuscation: Boolean

Gets keepTextSizeForObfuscation param

Definition Classes: BaseDeidParams

def getKeepYear: Boolean

Gets keepYear param

Definition Classes: BaseDeidParams

def getLanguage: String

Gets language param.

Definition Classes: BaseDeidParams

def getLazyAnnotator: Boolean

Definition Classes: CanBeLazy

def getMappingsColumn: String

Definition Classes: DeIdentificationParams

def getMaskStatus(entityClass: String): String

Attributes: protected
Definition Classes: MaskingParams

def getMaskingPolicy: String

Gets maskingPolicy param.

Definition Classes: MaskingParams

def getMaxSentence(annotations: Seq[Annotation]): Int

Attributes: protected
Definition Classes: DeidModelParams

def getMetadataMaskingPolicy: String

Gets metadataMaskingPolicy param

Definition Classes: DeIdentificationParams

def getMinYear: Int

Definition Classes: DeIdentificationParams

def getMode: String

Gets mode param.

Definition Classes: BaseDeidParams

def getNearTokens(tokenizedSentence: Seq[IndexedToken], count: Int, ngrams: Int = 2): (String, String)

def getNerEntitiesBySentence(annotations: Seq[Annotation], sentenceCount: Int): Seq[Seq[Annotation]]

Returns the NER Annotations for each Annotation instance in the input Sequence

annotations: a Sequence of Annotation instances
returns: a Sequence of Sequence[IndexedToken], each Sequence represents tokens from each input Annotation

def getObfuscateByAgeGroups: Boolean

Gets obfuscateByAgeGroups param

Definition Classes: DeIdentificationParams

def getObfuscateDate: Boolean

Gets obfuscateDate param

Definition Classes: BaseDeidParams

def getObfuscateRefSource: String

Gets obfuscateRefSource param.

Definition Classes: BaseDeidParams

def getObfuscationEquivalents: Option[Array[StaticObfuscationEntity]]

Gets the value of obfuscationEquivalents.

Definition Classes: BaseDeidParams

def getObfuscationStrategyOnException: String

Definition Classes: DeIdentificationParams

final def getOrDefault[T](param: Param[T]): T

Definition Classes: Params

final def getOutputCol: String

Definition Classes: HasOutputAnnotationCol

def getParam(paramName: String): Param[Any]

Definition Classes: Params

def getRegexEntities(tokensSentences: Seq[IndexedToken], idx: Int): Seq[Annotation]

Returns the Regex Annotations for each IndexedToken in the input Sequence

tokensSentences: a Sequence of IndexedToken instances
returns: a Sequence of Annotation, each Annotation represents Regex Entity

def getRegexEntities(): Array[String]

def getRegexOverride: Boolean

Definition Classes: DeIdentificationParams

def getRegexPatternsDictionary: Map[String, Array[String]]

dictionary with regular expression patterns that match some protected entity

def getRegion: String

Gets region param.

Definition Classes: BaseDeidParams

def getReturnEntityMappings: Boolean

Definition Classes: DeIdentificationParams

def getSameEntityThreshold: Double

Definition Classes: DeIdentificationParams

def getSameLengthFormattedEntities(): Array[String]

Definition Classes: BaseDeidParams

def getSeed(): Int

Definition Classes: BaseDeidParams

def getSelectiveObfuscateRefSource: Map[String, String]

Gets selectiveObfuscateRefSource param.

Definition Classes: BaseDeidParams

def getSelectiveObfuscateRefSourceAsStr: String

Definition Classes: BaseDeidParams

def getSelectiveObfuscationModes: Option[Map[String, Array[String]]]

Gets selectiveObfuscationModes param.

Definition Classes: BaseDeidParams

def getSentences(annotations: Seq[Annotation]): Seq[Sentence]

Returns the content of each sentence inside the input sequence

annotations: a Sequence of Annotation instances, to return content from
returns: a Sequence of Sentence

def getShiftDaysFromSentences(sentences: Seq[Annotation]): Option[Int]

Attributes: protected
Definition Classes: DeidModelParams

def getStaticObfuscationFakes(entityClass: String, wordToReplace: String): Option[Seq[String]]

Attributes: protected
Definition Classes: DeidModelParams

def getStaticObfuscationPairs: Option[Array[StaticObfuscationEntity]]

Definition Classes: BaseDeidParams

def getTokensBySentence(annotations: Seq[Annotation]): Seq[Seq[IndexedToken]]

Returns the tokens for each Annotation instance in the input Sequence

annotations: a Sequence of Annotation instances
returns: a Sequence of Sequence[IndexedToken], each Sequence represents tokens from each input Annotation

def getUnnormalizedDateMode: String

Gets unnormalizedDateMode param.

Definition Classes: BaseDeidParams

def getUseShiftDays: Boolean

Getter method of useShiftDays

Definition Classes: DeIdentificationParams → BaseDeidParams

def getValidAgeRanges: Array[Int]

Gets valid ageRanges whether ageRangesByHipaa is true or not.

Attributes: protected
Definition Classes: BaseDeidParams

def getZipCodeTag: String

Definition Classes: DeIdentificationParams

val groupByCol: Param[String]

The column name used to group the dataset.

The column name used to group the dataset. This parameter is used in conjunction with consistentObfuscation to ensure consistent obfuscation within each group. When groupByCol is set, the dataset is partitioned into groups based on the values of the specified column.

Default: "" (empty string, meaning no grouping)

The column name must be a valid string in the input dataset.
The column must be of StringType.

Definition Classes: DeIdentificationParams
Note: This functionality can change order of the dataset, so it is recommended to use it with caution.
,
This functionality cannot be supported by LightPipeline.

def handleCasing(originalFake: String, wordToReplace: String): String

Attributes: protected
Definition Classes: DeidModelParams

def handleGeographicConsistency(protectedEntities: Seq[Seq[Annotation]]): Seq[Seq[Annotation]]

Attributes: protected
Definition Classes: DeidModelParams

def handleObfuscationEquivalents(sentenceBaseAnnotations: Seq[Seq[Annotation]]): Seq[Seq[Annotation]]

Attributes: protected
Definition Classes: DeidModelParams

final def hasDefault[T](param: Param[T]): Boolean

Definition Classes: Params

def hasParam(paramName: String): Boolean

Definition Classes: Params

def hasParent: Boolean

Definition Classes: Model

def hashCode(): Int

Definition Classes: AnyRef → Any
Annotations: @native()

val ignoreRegex: BooleanParam

Select if you want to use regex file loaded in the model.

Select if you want to use regex file loaded in the model. If true the default regex file will be not used The default value is false.

Definition Classes: DeIdentificationParams

val inExceptionMode: Boolean

Attributes: protected
Definition Classes: HasSafeAnnotate

def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean

Attributes: protected
Definition Classes: Logging

def initializeLogIfNecessary(isInterpreter: Boolean): Unit

Attributes: protected
Definition Classes: Logging

val inputAnnotatorTypes: Array[AnnotatorType]

Input annotator type: DOCUMENT, TOKEN, CHUNK

Definition Classes: DeIdentificationModel → HasInputAnnotationCols

final val inputCols: StringArrayParam

Attributes: protected
Definition Classes: HasInputAnnotationCols

def isArabic: Boolean

Attributes: protected
Definition Classes: MaskingParams

final def isDefined(param: Param[_]): Boolean

Definition Classes: Params

def isEmptyString(value: String): Boolean

Attributes: protected
Definition Classes: DeidModelParams

def isGeoEntity(annotation: Annotation): Boolean

Attributes: protected
Definition Classes: DeidModelParams

def isGeoObfuscationEnabled: Boolean

Attributes: protected
Definition Classes: DeidModelParams

final def isInstanceOf[T0]: Boolean

Definition Classes: Any

def isObfuscateDate(entityClass: String): Boolean

Attributes: protected
Definition Classes: DeidModelParams

val isRandomDateDisplacement: BooleanParam

Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities,if false use the DeIdentificationParams.days The default value is false.

Definition Classes: DeIdentificationParams

def isRegexMatch(nerTokens: (String, String), token: String, regexPatterns: Array[String]): Boolean

Returns Boolean flag, which says if the token matches at least one pattern from array

token: a token of interest to check for the match
regexPatterns: an Array of String to check against the token
returns: a Boolean flag, representing if the token matches at least pattern one of regexPatterns

final def isSet(param: Param[_]): Boolean

Definition Classes: Params

def isTraceEnabled(): Boolean

Attributes: protected
Definition Classes: Logging

val keepMonth: BooleanParam

Whether to keep the month intact when obfuscating date entities.

Whether to keep the month intact when obfuscating date entities. If true, the month will remain unchanged during the obfuscation process. If false, the month will be modified along with the year and day. Default: false.

Definition Classes: BaseDeidParams

val keepTextSizeForObfuscation: BooleanParam

It specifies whether the output should maintain the same character length as the input text.

It specifies whether the output should maintain the same character length as the input text. the output text will remain the same if same length is available, else length might vary.

Definition Classes: BaseDeidParams

val keepYear: BooleanParam

Whether to keep the year intact when obfuscating date entities.

Whether to keep the year intact when obfuscating date entities. If true, the year will remain unchanged during the obfuscation process. If false, the year will be modified along with the month and day. Default: false.

Definition Classes: BaseDeidParams

val language: Param[String]

The language used to select the regex file and some faker entities.

The language used to select the regex file and some faker entities. 'en'(English),'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian) Default:'en'

Definition Classes: BaseDeidParams

val lazyAnnotator: BooleanParam

Definition Classes: CanBeLazy

implicit lazy val locale: Locale

Attributes: protected
Definition Classes: DeidModelParams

def log: Logger

Attributes: protected
Definition Classes: Logging

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes: protected
Definition Classes: Logging

def logDebug(msg: ⇒ String): Unit

Attributes: protected
Definition Classes: Logging

def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes: protected
Definition Classes: Logging

def logError(msg: ⇒ String): Unit

Attributes: protected
Definition Classes: Logging

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes: protected
Definition Classes: Logging

def logInfo(msg: ⇒ String): Unit

Attributes: protected
Definition Classes: Logging

def logName: String

Attributes: protected
Definition Classes: Logging

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes: protected
Definition Classes: Logging

def logTrace(msg: ⇒ String): Unit

Attributes: protected
Definition Classes: Logging

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes: protected
Definition Classes: Logging

def logWarning(msg: ⇒ String): Unit

Attributes: protected
Definition Classes: Logging

val mappingsColumn: Param[String]

This is the mapping column that will return the Annotations chunks with the fake entities

Definition Classes: DeIdentificationParams

def maskEntity(wordToReplace: String, entityClass: String): String

Attributes: protected
Definition Classes: MaskingParams

def maskEntity(annotation: Annotation, entityClass: String): String

Attributes: protected
Definition Classes: MaskingParams

def maskEntityWithPolicy(wordToReplace: String, maskingPolicy: String, entityClass: String): String

Attributes: protected
Definition Classes: MaskingParams

def maskEntityWithPolicy(annotation: Annotation, maskingPolicy: String, entityClass: String): String

Attributes: protected
Definition Classes: MaskingParams

val maskingPolicy: Param[String]

Select the masking policy:

'entity_labels': Replace the values with the entity value.
'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
'entity_labels_without_brackets': Replace the values with the entity value without brackets.
'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
Default: 'entity_labels'

Definition Classes: MaskingParams

def mergeEntities(nerEntities: Seq[Annotation], regexEntities: Seq[Annotation], regexOverride: Boolean = false): Seq[Annotation]

Returns a combined Sequence of Annotations, cleaned from duplicates

nerEntities: a sequence of NER Entities to combine
regexEntities: an sequence of Regex Entities to combine
returns: a Sequence of Annotation, which is result of a merge without duplicates

val metadataMaskingPolicy: Param[String]

If specified, the metadata includes the masked form of the document.

If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:

'entity_labels': Replace the values with the entity value.
'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
'entity_labels_without_brackets': Replace the values with the entity value without brackets.
'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
Default: ""

Definition Classes: DeIdentificationParams

val minYear: IntParam

Minimum year to use when converting date to year

Definition Classes: DeIdentificationParams

val mode: Param[String]

Mode for Anonymizer ['mask' or 'obfuscate'].

Mode for Anonymizer ['mask' or 'obfuscate']. Default: 'mask'

Mask mode: The entities will be replaced by their entity types.
Obfuscate mode: The entity is replaced by an obfuscator's term.

Definition Classes: BaseDeidParams

def msgHelper(schema: StructType): String

Attributes: protected
Definition Classes: HasInputAnnotationCols

val nameEntities: Seq[String]

Attributes: protected
Definition Classes: DeidModelParams

final def ne(arg0: AnyRef): Boolean

Definition Classes: AnyRef

final def notify(): Unit

Definition Classes: AnyRef
Annotations: @native()

final def notifyAll(): Unit

Definition Classes: AnyRef
Annotations: @native()

val obfuscateByAgeGroups: BooleanParam

Whether to obfuscate ages based on age groups.

When true, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When false, the age ranges specified in the ageRanges parameter will be used to obfuscate ages. Default: false.

Definition Classes: DeIdentificationParams

val obfuscateDate: BooleanParam

When mode=="obfuscate" whether to obfuscate dates or not.

When mode=="obfuscate" whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs. If the value is true and obfuscation is failed, then unnormalizedDateMode will be activated. When setting to 'false', then the date will be masked to <DATE>. Default: false

Definition Classes: BaseDeidParams

def obfuscateNameEntity(originalName: String, keepTextSize: Boolean, lengthDeviation: Int, namePartsMemory: Map[String, String]): String

Attributes: protected
Definition Classes: DeidModelParams

val obfuscateRefSource: Param[String]

The source of obfuscation to obfuscate the entities.

The source of obfuscation to obfuscate the entities. The values ar the following: 'file': Takes the entities from the obfuscatorRefFile 'faker': Takes the entities from the Faker module 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.

Definition Classes: BaseDeidParams

val obfuscationEquivalents: StructFeature[Array[StaticObfuscationEntity]]

variant-to-canonical entity mappings to ensure consistent obfuscation.

This method allows you to define equivalence rules for entity variants that should be obfuscated the same way. For example, the names "Alex" and "Alexander" will always be mapped to the same obfuscated value if they are linked to the same canonical form.

It accepts an array of string triplets, where each triplet defines:

variant: A non-standard, short, or alternative form of a value (e.g., "Alex")
entityType: The type of the entity (e.g., "NAME", "STATE", "COUNTRY")
canonical: The standardized form all variants map to (e.g., "Alexander")

variant and entityType comparisons are case-insensitive during processing.

This is especially useful in de-identification tasks to ensure consistent replacement of semantically identical values. It also allows cross-variant normalization across different occurrences of sensitive data.

Definition Classes: BaseDeidParams

val obfuscationStrategyOnException: Param[String]

The obfuscation strategy to be applied when an exception occurs.

The obfuscation strategy determines how obfuscation is handled in case of an exception. Four possible values are supported:

"mask": The original chunk is replaced with a masking pattern.
"default": The original chunk is replaced with a default faker.
"skip": The original chunk is not replaced with any faker.
"exception": Throws the exception.

The default obfuscation strategy is "default".

Definition Classes: DeIdentificationParams

def onWrite(path: String, spark: SparkSession): Unit

Attributes: protected
Definition Classes: ParamsAndFeaturesWritable

val optionalInputAnnotatorTypes: Array[String]

Definition Classes: HasInputAnnotationCols

val outputAnnotatorType: AnnotatorType

Output annotator types: DOCUMENT

Definition Classes: DeIdentificationModel → HasOutputAnnotatorType

val outputAsDocument: BooleanParam

Whether to return all sentences joined into a single document

Definition Classes: DeIdentificationParams

final val outputCol: Param[String]

Attributes: protected
Definition Classes: HasOutputAnnotationCol

lazy val params: Array[Param[_]]

Definition Classes: Params

var parent: Estimator[DeIdentificationModel]

Definition Classes: Model

lazy val randomDateFormat: String

Attributes: protected
Definition Classes: BaseDeidParams

val regexEntities: StringArrayParam

val regexOverride: BooleanParam

If the value is true, prioritize the regex entities; if the value is false, prioritize the ner.

If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false. If DeIdentification.combineRegexPatterns is true, this value will be invalid.

Definition Classes: DeIdentificationParams

val regexPatternsDictionary: MapFeature[String, Array[String]]

dictionary with regular expression patterns that match some protected entity

val region: Param[String]

With this property, you can select particular dateFormats.

With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates.

The values are following:
'eu' for European Union
'us' for USA

Definition Classes: BaseDeidParams

val returnEntityMappings: BooleanParam

With this property, you can select if you want to return mapping column.

Definition Classes: DeIdentificationParams

def safeAnnotate(annotations: Seq[Annotation]): Seq[Annotation]

A protected method designed to safely annotate a sequence of Annotation objects by handling exceptions.

annotations: A sequence of Annotation.
returns: A sequence of Annotation objects after processing, potentially containing error annotations.

Attributes: protected
Definition Classes: HasSafeAnnotate

val sameEntityThreshold: DoubleParam

Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn't apply.

Definition Classes: DeIdentificationParams

val sameLengthFormattedEntities: StringArrayParam

List of formatted entities to generate the same length outputs as original ones during obfuscation.

List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: "phone", "fax", "contact," "id", "idnum", "bioid", "medicalrecord", "zip", "vin", "ssn", "dln", "plate", "license", "IRS", "CFN", "account".

Definition Classes: BaseDeidParams

def save(path: String): Unit

Definition Classes: MLWritable
Annotations: @Since( "1.6.0" ) @throws( ... )

val seed: IntParam

It is the seed to select the entities on obfuscate mode.

It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Definition Classes: BaseDeidParams

def selectFakeFromAllFakes(wordToReplace: String, entityClass: String, maskedEntity: String, allFakes: Seq[String]): String

Attributes: protected
Definition Classes: DeidModelParams

val selectiveObfuscateRefSource: MapFeature[String, String]

A map of entity names to their obfuscation modes.

A map of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source.

Definition Classes: BaseDeidParams

val selectiveObfuscationModes: StructFeature[Map[String, Array[String]]]

The dictionary of modes to enable multi-mode deidentification.

'obfuscate': Replace the values with random values.
'mask_same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.
'mask_entity_labels': Replace the values with the entity value.
'mask_fixed_length_chars': Replace the name with the asterix with fixed length. You can also invoke "setFixedMaskLength()"
'mask_entity_labels_without_brackets': Replace the values with the entity value without brackets.
'mask_same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
'skip': Skip the entities (intact)

The entities which have not been given in dictionary will deidentify according to setMode()

Definition Classes: BaseDeidParams

def set[T](feature: StructFeature[T], value: T): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

def set[K, V](feature: MapFeature[K, V], value: Map[K, V]): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

def set[T](feature: SetFeature[T], value: Set[T]): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

def set[T](feature: ArrayFeature[T], value: Array[T]): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

final def set(paramPair: ParamPair[_]): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: Params

final def set(param: String, value: Any): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: Params

final def set[T](param: Param[T], value: T): DeIdentificationModel.this.type

Definition Classes: Params

def setAdditionalDateFormats(formats: Array[String]): DeIdentificationModel.this.type

Sets additionalDateFormats param

Definition Classes: BaseDeidParams

def setAgeGroups(value: Map[String, Array[Int]]): DeIdentificationModel.this.type

Sets the age groups to obfuscate ages.

Sets the age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges. The map should contain the age group name as the key and an array of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. Default age groups are as follows in the English language:

Map(
"baby" -> Array(0, 1),
"toddler" -> Array(1, 3),
"child" -> Array(3, 12),
"teenager" -> Array(12, 20),
"adult" -> Array(20, 65),
"senior" -> Array(65, 200)
)

Definition Classes: DeIdentificationParams
Exceptions thrown: IllegalArgumentException if the value is empty, contains negative values, or is not a pair of integers

def setAgeGroups(value: HashMap[String, ArrayList[Int]]): DeIdentificationModel.this.type

Definition Classes: DeIdentificationParams

def setAgeRanges(mode: Array[Int]): DeIdentificationModel.this.type

List of integers specifying limits of the age groups to preserve during obfuscation

Definition Classes: BaseDeidParams

def setAgeRangesByHipaa(value: Boolean): DeIdentificationModel.this.type

Sets whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

The HIPAA Privacy Rule mandates that ages from patients older than 90 years must be obfuscated, while age for patients 90 years or younger can remain unchanged.

value: If true, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, the others will remain unchanged. If false, ageRanges parameter is valid. Default: false.

Definition Classes: BaseDeidParams

def setAllTerms(value: Map[String, List[String]]): DeIdentificationModel.this.type

def setBlackList(list: Array[String]): DeIdentificationModel.this.type

List of entities that will be ignored to in the regex file.

List of entities that will be ignored to in the regex file. The rest will be processed. The default values are "IBAN","ZIP","NPI","DLN","PASSPORT","C_CARD","DEA","SSN", "IP", "DEA".

Definition Classes: DeIdentificationParams

def setBlackListEntities(value: Array[String]): DeIdentificationModel.this.type

Sets the list of entities coming from NER or regex rules that will be ignored for masking or obfuscation.

Sets the list of entities coming from NER or regex rules that will be ignored for masking or obfuscation. The rest entities will be processed. Defaults to an empty array.

Definition Classes: DeIdentificationParams

def setChunkMatching(categories: HashMap[String, Double]): DeIdentificationModel.this.type

Definition Classes: DeIdentificationParams

def setChunkMatching(value: Map[String, Double]): DeIdentificationModel.this.type

Performs entity chunk matching across rows or within groups in a DataFrame.

Performs entity chunk matching across rows or within groups in a DataFrame. Useful in de-identification pipelines where certain entity labels like "NAME" or "DATE" may be missing in some rows and need to be filled from other rows in the same group.

Notes:

When applying the method across multiple rows, the usage of groupByCol parameter is required.

Definition Classes: DeIdentificationParams

def setConsistentAcrossNameParts(value: Boolean): DeIdentificationModel.this.type

Sets the value of consistentAcrossNameParts.

value: Boolean flag to enforce consistency across name parts
returns: this instance

Definition Classes: BaseDeidParams

def setConsistentObfuscation(s: Boolean): DeIdentificationModel.this.type

Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

Definition Classes: DeIdentificationParams

def setCountryObfuscation(value: Boolean): DeIdentificationModel.this.type

Sets whether to obfuscate country entities or not.

Sets whether to obfuscate country entities or not. If true, country entities will be obfuscated using the Faker module. If false, country entities will be skipped during obfuscation. Default: false

Definition Classes: BaseDeidParams

def setDateEntities(value: Array[String]): DeIdentificationModel.this.type

Sets the value of dateEntities.

Sets the value of dateEntities. Default: Array("DATE", "DOB", "DOD", "EFFDATE", "FISCAL_YEAR")

Definition Classes: BaseDeidParams

def setDateFormats(s: Array[String]): DeIdentificationModel.this.type

Format of dates to displace

Definition Classes: BaseDeidParams

def setDateTag(s: String): DeIdentificationModel.this.type

Tag representing what are the NER entity (default: DATE)

Definition Classes: DeIdentificationParams

def setDateToYear(s: Boolean): DeIdentificationModel.this.type

true if dates must be converted to years, false otherwise

Definition Classes: DeIdentificationParams

def setDays(k: Int): DeIdentificationModel.this.type

Number of days to obfuscate the dates by displacement.

Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used

Definition Classes: BaseDeidParams

def setDefault[T](feature: StructFeature[T], value: () ⇒ T): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

def setDefault[K, V](feature: MapFeature[K, V], value: () ⇒ Map[K, V]): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

def setDefault[T](feature: SetFeature[T], value: () ⇒ Set[T]): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

def setDefault[T](feature: ArrayFeature[T], value: () ⇒ Array[T]): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: HasFeatures

final def setDefault(paramPairs: ParamPair[_]*): DeIdentificationModel.this.type

Attributes: protected
Definition Classes: Params

final def setDefault[T](param: Param[T], value: T): DeIdentificationModel.this.type

Attributes: protected[org.apache.spark.ml]
Definition Classes: Params

def setDoExceptionHandling(value: Boolean): DeIdentificationModel.this.type

If true, exceptions are handled.

If true, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Definition Classes: HandleExceptionParams

def setEnableDefaultObfuscationEquivalents(value: Boolean): DeIdentificationModel.this.type

Sets whether to enable default obfuscation equivalents for common entities.

Sets whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default: false

Definition Classes: BaseDeidParams

def setEntityCasingModes(value: Map[String, Array[String]]): DeIdentificationModel.this.type

Set dictionary with entity casing modes that match some entities.

Set dictionary with entity casing modes that match some entities. 'lowercase': Converts all characters to lower case using the rules of the default locale. 'uppercase': Converts all characters to upper case using the rules of the default locale. 'capitalize': Converts the first character to upper case and converts others to lower case. 'titlecase': Converts the first character in every token to upper case and converts others to lower case.

def setFakerLengthOffset(value: Int): DeIdentificationModel.this.type

Sets fakerLengthOffset param

Definition Classes: BaseDeidParams

def setFixedMaskLength(value: Int): DeIdentificationModel.this.type

Sets the value of fixedMaskLength.

Sets the value of fixedMaskLength. This is the length of the masking sequence that will be used when the 'fixed_length_chars' masking policy is selected.

Definition Classes: MaskingParams

def setGenderAwareness(value: Boolean): DeIdentificationModel.this.type

Whether to use gender-aware names or not during obfuscation.

Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance. Default: False

Definition Classes: BaseDeidParams

def setGeoConsistency(value: Boolean): DeIdentificationModel.this.type

Sets the value of geoConsistency.

Sets the value of geoConsistency. When set to true, it enables consistent obfuscation across geographical entities such as state, city, street, zip, and phone.

Definition Classes: BaseDeidParams

def setGroupByCol(value: String): DeIdentificationModel.this.type

Sets groupByCol param to group the dataset.

Sets groupByCol param to group the dataset. This parameter is used in conjunction with consistentObfuscation to ensure consistent obfuscation within each group.

Definition Classes: DeIdentificationParams
Note: This functionality can change order of the dataset, so it is recommended to use it with caution.
,
This functionality cannot be supported by LightPipeline.

def setIgnoreRegex(s: Boolean): DeIdentificationModel.this.type

Select if you want to use regex file loaded in the model.

Select if you want to use regex file loaded in the model. If true the default regex file will be not used The default value is false.

Definition Classes: DeIdentificationParams

final def setInputCols(value: String*): DeIdentificationModel.this.type

Definition Classes: HasInputAnnotationCols

def setInputCols(value: Array[String]): DeIdentificationModel.this.type

Definition Classes: HasInputAnnotationCols

def setIsRandomDateDisplacement(s: Boolean): DeIdentificationModel.this.type

Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities, if false use the DeIdentificationParams.days The default value is false.

Definition Classes: DeIdentificationParams

def setKeepMonth(value: Boolean): DeIdentificationModel.this.type

Sets whether to keep the month intact when obfuscating date entities.

Sets whether to keep the month intact when obfuscating date entities. If true, the month will remain unchanged during the obfuscation process. If false, the month will be modified along with the year and day. Default: false.

Definition Classes: BaseDeidParams

def setKeepTextSizeForObfuscation(value: Boolean): DeIdentificationModel.this.type

Sets keepTextSizeForObfuscation param

Definition Classes: BaseDeidParams

def setKeepYear(value: Boolean): DeIdentificationModel.this.type

Sets whether to keep the year intact when obfuscating date entities.

Sets whether to keep the year intact when obfuscating date entities. If true, the year will remain unchanged during the obfuscation process. If false, the year will be modified along with the month and day. Default: false.

Definition Classes: BaseDeidParams

def setLanguage(s: String): DeIdentificationModel.this.type

The language used to select the regex file and some faker entities.

The language used to select the regex file and some faker entities. 'en'(English),'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian). Default:'en'

Definition Classes: BaseDeidParams

def setLazyAnnotator(value: Boolean): DeIdentificationModel.this.type

Definition Classes: CanBeLazy

def setMappingsColumn(s: String): DeIdentificationModel.this.type

This is the mapping column that will return the Annotations chunks with the fake entities

Definition Classes: DeIdentificationParams

def setMaskingPolicy(value: String): DeIdentificationModel.this.type

Select the masking policy:

'entity_labels': Replace the values with the entity value.
'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
'entity_labels_without_brackets': Replace the values with the entity value without brackets.
'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
Default: 'entity_labels'

Definition Classes: MaskingParams

def setMetadataMaskingPolicy(value: String): DeIdentificationModel.this.type

If specified, the metadata includes the masked form of the document.

If specified, the metadata includes the masked form of the document. Select the following masking policy if you want to return mask form in the metadata:

'entity_labels': Replace the values with the entity value.
'same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.If the entity is less than 3 chars (like Jo, or 5), we can just use asterix without brackets.
'fixed_length_chars': Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
'entity_labels_without_brackets': Replace the values with the entity value without brackets.
'same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
Default: ""

Definition Classes: DeIdentificationParams

def setMinYear(s: Int): DeIdentificationModel.this.type

Minimum year to use when converting date to year

Definition Classes: DeIdentificationParams

def setMode(m: String): DeIdentificationModel.this.type

Mode for Anonymizer ['mask'|'obfuscate'].

Mode for Anonymizer ['mask'|'obfuscate']. Default: 'mask'

Mask mode: The entities will be replaced by their entity types.
Obfuscate mode: The entity is replaced by an obfuscator's term.

Definition Classes: BaseDeidParams

def setObfuscateByAgeGroups(value: Boolean): DeIdentificationModel.this.type

Sets whether to obfuscate ages based on age groups.

When true, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When false, the age ranges specified in the ageRanges parameter will be used to obfuscate ages. Default: false.

Definition Classes: DeIdentificationParams

def setObfuscateDate(s: Boolean): DeIdentificationModel.this.type

When mode=="obfuscate" whether to obfuscate dates or not.

When mode=="obfuscate" whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to true, make sure dateFormats param fits the needs. If the value is true and obfuscation is failed, then unnormalizedDateMode will be activated. When setting to 'false' then the date will be masked to <DATE> . Default: false

Definition Classes: BaseDeidParams

def setObfuscateRefSource(s: String): DeIdentificationModel.this.type

The source of obfuscation to obfuscate the entities.

The source of obfuscation to obfuscate the entities. The values are the following: 'file': Takes the fakes from the obfuscatorRefFile 'faker': Takes the fakes from the Faker module 'both': Takes the fakes from the obfuscatorRefFile and the faker module randomly.

Definition Classes: BaseDeidParams

def setObfuscationEquivalents(equivalents: ArrayList[ArrayList[String]]): DeIdentificationModel.this.type

Definition Classes: BaseDeidParams

def setObfuscationEquivalents(equivalents: Array[Array[String]]): DeIdentificationModel.this.type

Sets variant-to-canonical entity mappings to ensure consistent obfuscation.

This method allows you to define equivalence rules for entity variants that should be obfuscated the same way. For example, the names "Alex" and "Alexander" will always be mapped to the same obfuscated value if they are linked to the same canonical form.

It accepts an array of string triplets, where each triplet defines:

variant: A non-standard, short, or alternative form of a value (e.g., "Alex")
entityType: The type of the entity (e.g., "NAME", "STATE", "COUNTRY")
canonical: The standardized form all variants map to (e.g., "Alexander")

variant and entityType comparisons are case-insensitive during processing.

This is especially useful in de-identification tasks to ensure consistent replacement of semantically identical values. It also allows cross-variant normalization across different occurrences of sensitive data.

Example

val equivalents = Array(
  Array("Alex", "NAME", "Alexander"),
  Array("Rob", "NAME", "Robert"),
  Array("CA", "STATE", "California"),
  Array("Calif.", "STATE", "California")
)

myDeidTransformer.setObfuscationEquivalents(equivalents)

equivalents: Array of [variant, entityType, canonical] entries.

Definition Classes: BaseDeidParams
Exceptions thrown: IllegalArgumentException if any entry does not have exactly 3 elements.

def setObfuscationEquivalents(equivalents: Array[StaticObfuscationEntity]): DeIdentificationModel.this.type

Sets obfuscationEquivalents param.

Definition Classes: BaseDeidParams

def setObfuscationStrategyOnException(value: String): DeIdentificationModel.this.type

Sets the obfuscation strategy to be applied when an exception occurs.

The obfuscation strategy determines how obfuscation is handled in case of an exception. Four possible values are supported:

"mask": The original chunk is replaced with a masking pattern.
"default": The original chunk is replaced with a default faker.
"skip": The original chunk is not replaced with any faker.
"exception": Throws the exception.

The default obfuscation strategy is "default".

Definition Classes: DeIdentificationParams

def setOutputAsDocument(mode: Boolean): DeIdentificationModel.this.type

Whether to return all sentences joined into a single document

Definition Classes: DeIdentificationParams

final def setOutputCol(value: String): DeIdentificationModel.this.type

Definition Classes: HasOutputAnnotationCol

def setParent(parent: Estimator[DeIdentificationModel]): DeIdentificationModel

Definition Classes: Model

def setRegexOverride(s: Boolean): DeIdentificationModel.this.type

If the value is true, prioritize the regex entities; if the value is false, prioritize the ner.

If the value is true, prioritize the regex entities; if the value is false, prioritize the ner. The default value is false. If DeIdentification.combineRegexPatterns is true, this value will be invalid.

Definition Classes: DeIdentificationParams

def setRegexPatternsDictionary(value: Map[String, Array[String]]): DeIdentificationModel.this.type

dictionary with regular expression patterns that match some protected entity

def setRegion(s: String): DeIdentificationModel.this.type

With this property, you can select particular dateFormats.

With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. The values are following:

'eu' for European Union
'us' for USA

Definition Classes: BaseDeidParams

def setReturnEntityMappings(s: Boolean): DeIdentificationModel.this.type

With this property, you can select if you want to return mapping column.

Definition Classes: DeIdentificationParams

def setSameEntityThreshold(s: Double): DeIdentificationModel.this.type

Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn't apply.

Definition Classes: DeIdentificationParams

def setSameLengthFormattedEntities(entities: Array[String]): DeIdentificationModel.this.type

List of formatted entities to generate the same length outputs as original ones during obfuscation.

List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, CONTACT, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE, IRS, CFN, ACCOUNT.

Definition Classes: BaseDeidParams

def setSeed(s: Int): DeIdentificationModel.this.type

It is the seed to select the entities on obfuscate mode.

It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.

Definition Classes: BaseDeidParams

def setSelectiveObfuscateRefSource(value: HashMap[String, String]): DeIdentificationModel.this.type

Definition Classes: BaseDeidParams

def setSelectiveObfuscateRefSource(value: Map[String, String]): DeIdentificationModel.this.type

Sets the value of selectiveObfuscateRefSource.

Sets the value of selectiveObfuscateRefSource. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation method. The values can be: - 'file': Takes the fakes from the file. - 'faker': Takes the fakes from the embedded faker module. - 'both': Takes the fakes from the file and the faker module.

Definition Classes: BaseDeidParams

def setSelectiveObfuscationModes(value: HashMap[String, List[String]]): DeIdentificationModel.this.type

Definition Classes: BaseDeidParams

def setSelectiveObfuscationModes(value: Map[String, Array[String]]): DeIdentificationModel.this.type

Sets the value of selectiveObfuscationModes.

Sets the value of selectiveObfuscationModes. The dictionary of modes to enable multi-mode deidentification.

'obfuscate': Replace the values with random values.
'mask_same_length_chars': Replace the name with the asterix with same length minus two plus brackets on both end.
'mask_entity_labels': Replace the values with the entity value.
'mask_fixed_length_chars': Replace the name with the asterix with fixed length. You should also invoke "setFixedMaskLength()"
'mask_entity_labels_without_brackets': Replace the values with the entity value without brackets.
'mask_same_length_chars_without_brackets': Replace the name with the asterix with same length without brackets.
'skip': Skip the entities (intact)

The entities which have not been given in dictionary will deidentify according to setMode()

Example:

deidAnnotator
.setMode("mask")
.setSelectiveObfuscationModes(Map(
    "OBFUSCATE" -> Array("PHONE", "email"),
    "mask_entity_labels" -> Array("NAME", "CITY"),
    "skip" -> Array("id", "idnum"),
    "mask_same_length_chars" -> Array("fax"),
    "mask_fixed_length_chars" -> Array("zip")
))
.setFixedMaskLength(4)

Definition Classes: BaseDeidParams

def setStaticObfuscationPairs(pairs: ArrayList[ArrayList[String]]): DeIdentificationModel.this.type

Definition Classes: BaseDeidParams

def setStaticObfuscationPairs(pairs: Array[StaticObfuscationEntity]): DeIdentificationModel.this.type

Definition Classes: BaseDeidParams

def setStaticObfuscationPairs(pairs: Array[Array[String]]): DeIdentificationModel.this.type

Sets the static obfuscation pairs.

Sets the static obfuscation pairs. Each pair should contain three elements: original, entity type, and fake. The pairs must have exactly 3 elements: [original, entityType, fake].

pairs: An array of arrays containing the static obfuscation pairs.

Definition Classes: BaseDeidParams

def setUnnormalizedDateMode(mode: String): DeIdentificationModel.this.type

The mode to use if the date is not formatted.

The mode to use if the date is not formatted. Options: [mask, obfuscate, skip] Default: obfuscate

Definition Classes: BaseDeidParams

def setUseShiftDays(s: Boolean): DeIdentificationModel.this.type

Sets the value of useShiftDays.

Sets the value of useShiftDays. Whether to use the random shift day when the document has this in its metadata. DocumentHashCoder can create 'dateshift' based on the document. Default: false

Definition Classes: DeIdentificationParams → BaseDeidParams

def setZipCodeTag(s: String): DeIdentificationModel.this.type

Definition Classes: DeIdentificationParams

def shouldUseConsistentNameParts(entityClass: String): Boolean

Attributes: protected
Definition Classes: DeidModelParams

val staticObfuscationPairs: StructFeature[Array[StaticObfuscationEntity]]

A resource containing static obfuscation pairs.

A resource containing static obfuscation pairs. Each pair should contain three elements: original, entity type, and fake.

Definition Classes: BaseDeidParams

final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes: AnyRef

def toString(): String

Definition Classes: Identifiable → AnyRef → Any

final def transform(dataset: Dataset[_]): DataFrame

Definition Classes: AnnotatorModel → Transformer

def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame

Definition Classes: Transformer
Annotations: @Since( "2.0.0" )

def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame

Definition Classes: Transformer
Annotations: @Since( "2.0.0" ) @varargs()

final def transformSchema(schema: StructType): StructType

Definition Classes: RawAnnotator → PipelineStage

def transformSchema(schema: StructType, logging: Boolean): StructType

Attributes: protected
Definition Classes: PipelineStage
Annotations: @DeveloperApi()

def udfDocuments: UserDefinedFunction

def udfProtectedEntities: UserDefinedFunction

val uid: String

Definition Classes: DeIdentificationModel → Identifiable

val unnormalizedDateMode: Param[String]

The mode to use if the date is not formatted.

The mode to use if the date is not formatted. Options: [mask, obfuscate, skip] Default: obfuscate

Definition Classes: BaseDeidParams

val useShifDays: BooleanParam

Use shift days : Whether to use the random shift day when the document has this in its metadata.

Use shift days : Whether to use the random shift day when the document has this in its metadata. Default: False

Definition Classes: DeIdentificationParams

val useShiftDays: BooleanParam

Whether to use the random shift day when the document has this in its metadata.

Whether to use the random shift day when the document has this in its metadata. DocumentHashCoder can create 'dateshift' based on the document. Default: false

Definition Classes: BaseDeidParams

def validate(schema: StructType): Boolean

Attributes: protected
Definition Classes: RawAnnotator

final def wait(): Unit

Definition Classes: AnyRef
Annotations: @throws( ... )

final def wait(arg0: Long, arg1: Int): Unit

Definition Classes: AnyRef
Annotations: @throws( ... )

final def wait(arg0: Long): Unit

Definition Classes: AnyRef
Annotations: @throws( ... ) @native()

def wrapColumn(col: Column): Column

def wrapColumnMetadata(col: Column): Column

Attributes: protected
Definition Classes: RawAnnotator

def write: MLWriter

Definition Classes: ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable

val zipCodeTag: Param[String]

Definition Classes: DeIdentificationParams

Packages

DeIdentificationModel 

Companion object DeIdentificationModel

class DeIdentificationModel extends AnnotatorModel[DeIdentificationModel] with DeIdentificationParams with DeidModelParams with HasSimpleAnnotate[DeIdentificationModel] with HandleExceptionParams with HasSafeAnnotate[DeIdentificationModel] with CheckLicense

Instance Constructors

Type Members

Value Members

Example

Deprecated Value Members

Inherited from CheckLicense

Inherited from HasSafeAnnotate[DeIdentificationModel]

Inherited from HandleExceptionParams

Inherited from HasSimpleAnnotate[DeIdentificationModel]

Inherited from DeidModelParams

Inherited from DeIdentificationParams

Inherited from MaskingParams

Inherited from BaseDeidParams

Inherited from AnnotatorModel[DeIdentificationModel]

Inherited from CanBeLazy

Inherited from RawAnnotator[DeIdentificationModel]

Inherited from HasOutputAnnotationCol

Inherited from HasInputAnnotationCols

Inherited from HasOutputAnnotatorType

Inherited from ParamsAndFeaturesWritable

Inherited from HasFeatures

Inherited from DefaultParamsWritable

Inherited from MLWritable

Inherited from Model[DeIdentificationModel]

Inherited from Transformer

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Parameters

Annotator types

Members

Parameter setters

Parameter getters

DeIdentificationModel