Packages

package deid

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class DeIdentification extends AnnotatorApproach[DeIdentificationModel] with DeIdentificationParams with CheckLicense

    Contains all the methods for training a DeIdentificationModel model.

    Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

    DATE \d{4}
    AID \d{6,7}

    Additionally, obfuscation strings can be defined with setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with setRefFileFormat and setRefSep.

    Dr. Gregory House#DOCTOR
    01010101#MEDICALRECORD

    The configuration params for that module are in trait DeIdentificationParams.

    See also

    DeIdentificationModel

    DeIdentificationParams

    train Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs Example of pipeline for deidentification.

    Example

    val documentAssembler = new DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
        .setInputCols(Array("document"))
        .setOutputCol("sentence")
        .setUseAbbreviations(true)
    
    val tokenizer = new Tokenizer()
        .setInputCols(Array("sentence"))
        .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel
        .pretrained("embeddings_clinical", "en", "clinical/models")
        .setInputCols(Array("sentence", "token"))
        .setOutputCol("embeddings")

    Ner entities

    val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
           .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")
    
    val nerConverter = new NerConverter()
        .setInputCols(Array("sentence", "token", "ner"))
        .setOutputCol("ner_con")

    Deidentification

    val deIdentification = new DeIdentification()
        .setInputCols(Array("ner_chunk", "token", "sentence"))
        .setOutputCol("dei")
        // file with custom regex patterns for custom entities
        .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
        // file with custom obfuscator names for the entities
        .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
        .setRefFileFormat("csv")
        .setRefSep("#")
        .setMode("obfuscate")
        .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
        .setObfuscateDate(true)
        .setDateTag("DATE")
        .setDays(5)
        .setObfuscateRefSource("file")

    Pipeline

    val data = Seq(
      "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
    ).toDF("text")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      clinical_sensitive_entities,
      nerConverter,
      deIdentification
    ))
    val result = pipeline.fit(data).transform(data)
    
    result.select("dei.result").show(truncate = false)

    Show Results

    result.select("dei.result").show(truncate = false)
    +--------------------------------------------------------------------------------------------------+
    |result                                                                                            |
    +--------------------------------------------------------------------------------------------------+
    |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
    +--------------------------------------------------------------------------------------------------+
  2. class DeIdentificationModel extends AnnotatorModel[DeIdentificationModel] with DeIdentificationParams with HasSimpleAnnotate[DeIdentificationModel] with CheckLicense

    Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.

    Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.

    To create an configured DeIdentificationModel, please see the example of DeIdentification.

    See also

    DeIdentification to train your own model

  3. trait DeIdentificationParams extends Params

    A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators

    A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators

    See also

    DeIdentification

    DeIdentificationModel

  4. case class MySentnece(content: String, start: Int, end: Int, index: Int, originalIndex: Int) extends Product with Serializable
  5. class ObfuscatorAnnotatorApproach extends AnnotatorApproach[ObfuscatorAnnotatorModel] with ObfuscatorParams
  6. class ObfuscatorAnnotatorModel extends AnnotatorModel[ObfuscatorAnnotatorModel] with ObfuscatorParams with HasSimpleAnnotate[ObfuscatorAnnotatorModel]
  7. trait ObfuscatorParams extends Params
    Attributes
    protected
  8. class ReIdentification extends AnnotatorModel[DeIdentificationModel] with HasSimpleAnnotate[DeIdentificationModel] with CheckLicense

    Reidentifies obfuscated entities by DeIdentification.

    Reidentifies obfuscated entities by DeIdentification. This annotator requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn. To see how the entities are deidentified, please refer to the example of that class.

    Example

    Define the reidentification stage and transform the deidentified documents

    val reideintification = new ReIdentification()
      .setInputCols("dei", "protectedEntities")
      .setOutputCol("reid")
      .transform(result)

    Show results

    result.select("dei.result").show(truncate = false)
    +--------------------------------------------------------------------------------------------------+
    |result                                                                                            |
    +--------------------------------------------------------------------------------------------------+
    |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
    +--------------------------------------------------------------------------------------------------+
    
    reideintification.selectExpr("explode(reid.result)").show(false)
    +-----------------------------------------------------------------------------------+
    |col                                                                                |
    +-----------------------------------------------------------------------------------+
    |# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.|
    +-----------------------------------------------------------------------------------+
    See also

    DeIdentification for deidentification of entities

  9. trait ReadablePretrainedDeId extends ParamsAndFeaturesReadable[DeIdentificationModel] with HasPretrained[DeIdentificationModel]
  10. case class SentenceMaxException(message: String = "", cause: Throwable = None.orNull) extends Exception with Product with Serializable
  11. case class StructuredDeid(conllFilePath: String, regexPatternsFilePath: String, obfuscateRefFilePath: String) extends Product with Serializable

    Utility class that helps to create a pipeline to obfuscate tabular data.

    Utility class that helps to create a pipeline to obfuscate tabular data.

    conllFilePath

    File used to train the Deidentification.

    regexPatternsFilePath

    The regex file used to deidentificate the tabular data.

    obfuscateRefFilePath

    The fule used with the terms used to obfuscate the data.

  12. case class StructuredDeidentification(columnsMap: Map[String, String], seedMap: Map[String, Int] = Collections.emptyMap(), obfuscateRefFile: String = "", obfuscateRefSource: String = "both") extends Product with Serializable

    Utility class that helps to obfuscate tabular data.

    Utility class that helps to obfuscate tabular data.

    columnsMap

    Is a map that allows to select the name of the column with the entity. The key of the the map is the column in the dataframe and the value of the map is the entity for that column. The default entities are:

    Entity

    description

    location-other

    A location that is not country, street,hospital,city or state

    street

    A street

    hospital

    The name of a hospital.

    city

    A city

    state

    A state

    zip

    The zip code

    country

    A country

    contact

    The contact of one person

    username

    A username

    phone

    A number phone.

    fax

    The number fax

    url

    A url for internet

    email

    The email of one person

    profession

    A profession of one person

    name

    The name opf one person

    doctor

    The name of a doctor

    patient

    The name of the patient

    id

    A general Id number

    bioid

    Is a system to screen for protein interactions as they occur in living cells

    age

    The age of something or someone

    organization

    Name of one organization o company

    healthplan

    The id that identify the healthplan

    medicalrecord

    The identification of a medical record

    device

    The id that identified a device

    date

    A general date

    If it is not present, it will be masked.

    seedMap

    : Allow to add a seed to the column that you want to obfuscate. The seed used to randomly select the entities used during obfuscation mode. By providing the same seed, you can replicate the same mapping multiple times.

    obfuscateRefFile

    This is an optional parameter that allows to add your own terms to be used for obfuscation. The file contains as a key the entity and as the value the terms that will be used in the \ obfuscation.

    obfuscateRefSource

    The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The allowed values are the following: 'file': Takes the entities from the obfuscatorRefFile 'faker': Takes the entities from the Faker module 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.

  13. case class TextToDocumentColumns(columns: List[String]) extends Product with Serializable
    Attributes
    protected

Value Members

  1. object DeIdentification extends DefaultParamsReadable[DeIdentification] with Serializable
  2. object DeIdentificationModel extends ReadablePretrainedDeId with Serializable
  3. object DefaultRegex
    Attributes
    protected
  4. object Obfuscator
    Attributes
    protected
  5. object ObfuscatorAnnotatorApproach extends DefaultParamsReadable[ObfuscatorAnnotatorApproach] with Serializable
  6. object ObfuscatorParams extends DefaultParamsReadable[DeIdentification] with Serializable

Ungrouped