Packages

package deid

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. deid
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class DeIdentification extends AnnotatorApproach[DeIdentificationModel] with DeIdentificationParams with CheckLicense

    Contains all the methods for training a DeIdentificationModel model.

    Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

    DATE \d{4}
    AID \d{6,7}

    Additionally, obfuscation strings can be defined with setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with setRefFileFormat and setRefSep.

    Dr. Gregory House#DOCTOR
    01010101#MEDICALRECORD

    The configuration params for that module are in trait DeIdentificationParams.

    See also

    DeIdentificationModel

    DeIdentificationParams

    train Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs Example of pipeline for deidentification.

    Example

    val documentAssembler = new DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
        .setInputCols(Array("document"))
        .setOutputCol("sentence")
        .setUseAbbreviations(true)
    
    val tokenizer = new Tokenizer()
        .setInputCols(Array("sentence"))
        .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel
        .pretrained("embeddings_clinical", "en", "clinical/models")
        .setInputCols(Array("sentence", "token"))
        .setOutputCol("embeddings")

    Ner entities

     val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")
    
     val nerConverter = new NerConverter()
         .setInputCols(Array("sentence", "token", "ner"))
         .setOutputCol("ner_con")

    Deidentification

    val deIdentification = new DeIdentification()
        .setInputCols(Array("ner_chunk", "token", "sentence"))
        .setOutputCol("dei")
        // file with custom regex patterns for custom entities
        .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
        // file with custom obfuscator names for the entities
        .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
        .setRefFileFormat("csv")
        .setRefSep("#")
        .setMode("obfuscate")
        .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
        .setObfuscateDate(true)
        .setDateTag("DATE")
        .setDays(5)
        .setObfuscateRefSource("file")

    Pipeline

    val data = Seq(
      "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
    ).toDF("text")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      clinical_sensitive_entities,
      nerConverter,
      deIdentification
    ))
    val result = pipeline.fit(data).transform(data)
    
    
    result.select("dei.result").show(truncate = false)

    Show Results

    result.select("dei.result").show(truncate = false)
    +--------------------------------------------------------------------------------------------------+
    |result                                                                                            |
    +--------------------------------------------------------------------------------------------------+
    |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
    +--------------------------------------------------------------------------------------------------+
  2. class DeIdentificationModel extends AnnotatorModel[DeIdentificationModel] with DeIdentificationParams with HasSimpleAnnotate[DeIdentificationModel] with CheckLicense

    Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.

    Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.

    To create an configured DeIdentificationModel, please see the example of DeIdentification.

    See also

    DeIdentification to train your own model

  3. trait DeIdentificationParams extends Params

    A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators

    A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators

    See also

    DeIdentification

    DeIdentificationModel

  4. case class MySentnece(content: String, start: Int, end: Int, index: Int, originalIndex: Int) extends Product with Serializable
  5. class ObfuscatorAnnotatorApproach extends AnnotatorApproach[ObfuscatorAnnotatorModel] with ObfuscatorParams
  6. class ObfuscatorAnnotatorModel extends AnnotatorModel[ObfuscatorAnnotatorModel] with ObfuscatorParams with HasSimpleAnnotate[ObfuscatorAnnotatorModel]
  7. trait ObfuscatorParams extends Params
    Attributes
    protected
  8. class ReIdentification extends AnnotatorModel[DeIdentificationModel] with HasSimpleAnnotate[DeIdentificationModel] with CheckLicense

    Reidentifies obfuscated entities by DeIdentification.

    Reidentifies obfuscated entities by DeIdentification. This annotator requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn. To see how the entities are deidentified, please refer to the example of that class.

    Example

    Define the reidentification stage and transform the deidentified documents

    val reideintification = new ReIdentification()
      .setInputCols("dei", "protectedEntities")
      .setOutputCol("reid")
      .transform(result)

    Show results

    result.select("dei.result").show(truncate = false)
    +--------------------------------------------------------------------------------------------------+
    |result                                                                                            |
    +--------------------------------------------------------------------------------------------------+
    |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
    +--------------------------------------------------------------------------------------------------+
    
    reideintification.selectExpr("explode(reid.result)").show(false)
    +-----------------------------------------------------------------------------------+
    |col                                                                                |
    +-----------------------------------------------------------------------------------+
    |# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.|
    +-----------------------------------------------------------------------------------+
    See also

    DeIdentification for deidentification of entities

  9. trait ReadablePretrainedDeId extends ParamsAndFeaturesReadable[DeIdentificationModel] with HasPretrained[DeIdentificationModel]
  10. trait ReadsFeatures extends ParamsAndFeaturesReadable[DeIdentificationModel]
  11. case class SentenceMaxException(message: String = "", cause: Throwable = None.orNull) extends Exception with Product with Serializable
  12. case class StructuredDeidentification(columnsMap: Map[String, String], seedMap: Map[String, Int] = Collections.emptyMap(), obfuscateRefFile: String = "", obfuscateRefSource: String = "both", days: Int = 0, useRandomDateDisplacement: Boolean = false, dateFormats: List[String] = ...) extends Product with Serializable

    Utility class that helps to obfuscate tabular data.

    Utility class that helps to obfuscate tabular data.

    columnsMap

    Is a map that allows to select the name of the column with the entity. The key of the the map is the column in the dataframe and the value of the map is the entity for that column. The default entities are: |Entity | description | |location| A general location.| |location-other| A location that is not country, street,hospital,city or state| |street| A street| |hospital| The name of a hospital.| |city| A city| |state|A state| |zip| The zip code| |country| A country| |contact| The contact of one person| |username|A username | |phone| A number phone.| |fax| The number fax| |url| A url for internet| |email| The email of one person| |profession| A profession of one person| |name| The name opf one person| |doctor|The name of a doctor| |patient| The name of the patient| |id| A general Id number| |bioid|Is a system to screen for protein interactions as they occur in living cells| |age|The age of something or someone| |organization| Name of one organization o company| |healthplan| The id that identify the healthplan| |medicalrecord| The identification of a medical record| |device|The id that identified a device| |date| A general date| If is not present will be masked.

    seedMap

    : Allow to add a seed to the column that you want to obfuscate. The seed used to randomly select the entities used during obfuscation mode. By providing the same seed, you can replicate the same mapping multiple times.

    obfuscateRefFile

    This is an optional parameter that allows to add your own terms to be used for obfuscation. The file contains as a key the entity and as the value the terms that will be used in the \ obfuscation.

    obfuscateRefSource

    The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The allowed values are the following: 'file': Takes the entities from the obfuscatorRefFile 'faker': Takes the entities from the Faker module 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.

    days

    Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used

    useRandomDateDisplacement

    Use a random displacement days in dates entities.If true use random displacement days in dates entities,if false use the days

    dateFormats

    Format of dates to displaceFormat of dates to displace

  13. case class TextToDocumentColumns(columns: List[String]) extends Product with Serializable
    Attributes
    protected

Value Members

  1. val securerandom: SecureRandom
  2. object DeIdentification extends DefaultParamsReadable[DeIdentification] with Serializable
  3. object DeIdentificationModel extends ReadablePretrainedDeId with ReadsFeatures with Serializable
  4. object DefaultRegex
    Attributes
    protected
  5. object Language
  6. object Obfuscator
    Attributes
    protected
  7. object ObfuscatorAnnotatorApproach extends DefaultParamsReadable[ObfuscatorAnnotatorApproach] with Serializable
  8. object ObfuscatorParams extends DefaultParamsReadable[DeIdentification] with Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped