c

com.johnsnowlabs.nlp.annotators.deid

StructuredDeidentification

case class StructuredDeidentification(columnsMap: Map[String, String], seedMap: Map[String, Int] = Collections.emptyMap(), obfuscateRefFile: String = "", obfuscateRefSource: String = "both", days: Int = 0, useRandomDateDisplacement: Boolean = false, dateFormats: List[String] = ..., language: String = Language.English, idColumn: String = "", region: String = "", keepYear: Boolean = false, keepMonth: Boolean = false, unnormalizedDateMode: String = "obfuscate", keepTextSizeForObfuscation: Boolean = false, fakerLengthOffset: Int = 3, genderAwareness: Boolean = false, ageRangesByHipaa: Boolean = false) extends Product with Serializable

Utility class that helps to obfuscate tabular data.

columnsMap

It is a map that allows to select the name of the column with the entity. The key of the the map is the column in the dataframe and the value of the map is the entity for that column. The default entities are:

  • |Entity | description |
  • |location| A general location.|
  • |location-other| A location that is not country, street,hospital,city or state|
  • |street| A street|
  • |hospital| The name of a hospital.|
  • |city| A city|
  • |state|A state|
  • |zip| The zip code|
  • |country| A country|
  • |contact| The contact of one person|
  • |username|A username |
  • |phone| A number phone.|
  • |fax| The number fax|
  • |url| A url for internet|
  • |email| The email of one person|
  • |profession| A profession of one person|
  • |name| The name opf one person|
  • |doctor|The name of a doctor|
  • |patient| The name of the patient|
  • |id| A general Id number|
  • |bioid|Is a system to screen for protein interactions as they occur in living cells|
  • |age|The age of something or someone|
  • |organization| Name of one organization o company|
  • |healthplan| The id that identify the healthplan|
  • |medicalrecord| The identification of a medical record|
  • |device|The id that identified a device|
  • |date| A general date|
  • |ssn| A Social Security Number|
  • |ip| A Internet Protocol|
  • |passport| A random passport|
  • |dln| A Driver's License Number |
  • |npi| A National Provider Identifier|
  • |c_card| A credit card number|
  • |iban| A International Bank Account Number|
  • |dea| A Drug Enforcement Administration| If is not present will be masked.
seedMap

Allow to add a seed to the column that you want to obfuscate. The seed used to randomly select the entities used during obfuscation mode. By providing the same seed, you can replicate the same mapping multiple times.

obfuscateRefFile

This is an optional parameter that allows to add your own terms to be used for obfuscation. The file contains as a key the entity and as the value the terms that will be used in the obfuscation.

obfuscateRefSource

The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The allowed values are the following:

  • 'file': Takes the entities from the obfuscatorRefFile
  • 'faker': Takes the entities from the Faker module
  • 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.
days

Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used.

useRandomDateDisplacement

Use a random displacement days in dates entities. If true, use random displacement days in dates entities, otherwise use the days parameter.

dateFormats

Format of dates to displaceFormat of dates to displace.

language

The language used to select faker entities. The values are the following:

  • 'en'(English),
  • 'de'(German),
  • 'es'(Spanish),
  • 'fr'(French),
  • 'ar'(Arabic)
  • 'ro'(Romanian). Default:'en'
idColumn

The column that contains the id of the row. If provided, data will obfuscate consistently by idColumn, especially date entities.

region

With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. The values are the following: 'eu' for European Union, 'us' for USA.

keepYear

Whether to keep the year intact when obfuscating date entities. If true, the year will remain unchanged during the obfuscation process.

keepMonth

Whether to keep the month intact when obfuscating date entities. If true, the month will remain unchanged during the obfuscation process.

unnormalizedDateMode

The mode to use if the date is not formatted. The values are the following: 'mask', 'obfuscate', 'skip'. Default: obfuscate.

keepTextSizeForObfuscation

Whether the output should maintain the same character length as the input text.

fakerLengthOffset

It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Value must be greater than 0. Default is 3.

genderAwareness

Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance.

ageRangesByHipaa

Whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

Linear Supertypes
Serializable, Serializable, Product, Equals, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. StructuredDeidentification
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. AnyRef
  7. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new StructuredDeidentification(columnsMap: Map[String, String], seedMap: Map[String, Int] = Collections.emptyMap(), obfuscateRefFile: String = "", obfuscateRefSource: String = "both", days: Int = 0, useRandomDateDisplacement: Boolean = false, dateFormats: List[String] = ..., language: String = Language.English, idColumn: String = "", region: String = "", keepYear: Boolean = false, keepMonth: Boolean = false, unnormalizedDateMode: String = "obfuscate", keepTextSizeForObfuscation: Boolean = false, fakerLengthOffset: Int = 3, genderAwareness: Boolean = false, ageRangesByHipaa: Boolean = false)

    columnsMap

    It is a map that allows to select the name of the column with the entity. The key of the the map is the column in the dataframe and the value of the map is the entity for that column. The default entities are:

    • |Entity | description |
    • |location| A general location.|
    • |location-other| A location that is not country, street,hospital,city or state|
    • |street| A street|
    • |hospital| The name of a hospital.|
    • |city| A city|
    • |state|A state|
    • |zip| The zip code|
    • |country| A country|
    • |contact| The contact of one person|
    • |username|A username |
    • |phone| A number phone.|
    • |fax| The number fax|
    • |url| A url for internet|
    • |email| The email of one person|
    • |profession| A profession of one person|
    • |name| The name opf one person|
    • |doctor|The name of a doctor|
    • |patient| The name of the patient|
    • |id| A general Id number|
    • |bioid|Is a system to screen for protein interactions as they occur in living cells|
    • |age|The age of something or someone|
    • |organization| Name of one organization o company|
    • |healthplan| The id that identify the healthplan|
    • |medicalrecord| The identification of a medical record|
    • |device|The id that identified a device|
    • |date| A general date|
    • |ssn| A Social Security Number|
    • |ip| A Internet Protocol|
    • |passport| A random passport|
    • |dln| A Driver's License Number |
    • |npi| A National Provider Identifier|
    • |c_card| A credit card number|
    • |iban| A International Bank Account Number|
    • |dea| A Drug Enforcement Administration| If is not present will be masked.
    seedMap

    Allow to add a seed to the column that you want to obfuscate. The seed used to randomly select the entities used during obfuscation mode. By providing the same seed, you can replicate the same mapping multiple times.

    obfuscateRefFile

    This is an optional parameter that allows to add your own terms to be used for obfuscation. The file contains as a key the entity and as the value the terms that will be used in the obfuscation.

    obfuscateRefSource

    The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The allowed values are the following:

    • 'file': Takes the entities from the obfuscatorRefFile
    • 'faker': Takes the entities from the Faker module
    • 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.
    days

    Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used.

    useRandomDateDisplacement

    Use a random displacement days in dates entities. If true, use random displacement days in dates entities, otherwise use the days parameter.

    dateFormats

    Format of dates to displaceFormat of dates to displace.

    language

    The language used to select faker entities. The values are the following:

    • 'en'(English),
    • 'de'(German),
    • 'es'(Spanish),
    • 'fr'(French),
    • 'ar'(Arabic)
    • 'ro'(Romanian). Default:'en'
    idColumn

    The column that contains the id of the row. If provided, data will obfuscate consistently by idColumn, especially date entities.

    region

    With this property, you can select particular dateFormats. This property is especially used when obfuscating dates. The values are the following: 'eu' for European Union, 'us' for USA.

    keepYear

    Whether to keep the year intact when obfuscating date entities. If true, the year will remain unchanged during the obfuscation process.

    keepMonth

    Whether to keep the month intact when obfuscating date entities. If true, the month will remain unchanged during the obfuscation process.

    unnormalizedDateMode

    The mode to use if the date is not formatted. The values are the following: 'mask', 'obfuscate', 'skip'. Default: obfuscate.

    keepTextSizeForObfuscation

    Whether the output should maintain the same character length as the input text.

    fakerLengthOffset

    It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. Value must be greater than 0. Default is 3.

    genderAwareness

    Whether to use gender-aware names or not during obfuscation. This param effects only names. If value is true, it might decrease performance.

    ageRangesByHipaa

    Whether to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. val ageRangesByHipaa: Boolean
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  7. val columnsMap: Map[String, String]
  8. val dateFormats: List[String]
  9. val days: Int
  10. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  11. val fakerLengthOffset: Int
  12. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. val genderAwareness: Boolean
  14. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  15. val idColumn: String
  16. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  17. val keepMonth: Boolean
  18. val keepTextSizeForObfuscation: Boolean
  19. val keepYear: Boolean
  20. val language: String
  21. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  22. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  23. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  24. def obfuscateColumns(dataFrame: DataFrame, outputAsArray: Boolean = true, overwrite: Boolean = true, suffix: String = "_obfuscated"): DataFrame

    Obfuscate the columns of a dataframe.

    Obfuscate the columns of a dataframe. The columns are obfuscated by the entities specified in the columnsMap.

    dataFrame

    The dataframe to obfuscate.

    outputAsArray

    If true, the output will be an array of strings, otherwise will be a string.

    overwrite

    If true, the output columns will overwrite the input columns, otherwise will be a new column with the suffix.

    suffix

    The suffix to add to the output columns if overwrite is false.

  25. val obfuscateRefFile: String
  26. val obfuscateRefSource: String
  27. val region: String
  28. val seedMap: Map[String, Int]
  29. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  30. val unnormalizedDateMode: String
  31. val useRandomDateDisplacement: Boolean
  32. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  33. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  34. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped