Packages

package deid

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. deid
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. trait BaseDeidParams extends Params

    A trait that contains all the params that are common in DeIdentificationParams and ObfuscatorParams.

    A trait that contains all the params that are common in DeIdentificationParams and ObfuscatorParams.

    See also

    DeIdentificationParams

    ObfuscatorParams

    DeidModelParams

  2. class DateChunkObfuscator extends AnnotatorModel[DateChunkObfuscator] with HasSimpleAnnotate[DateChunkObfuscator] with CheckLicense
  3. class DeIdentification extends AnnotatorApproach[DeIdentificationModel] with DeIdentificationParams with DeidApproachParams with HandleExceptionParams with CheckLicense

    Contains all the methods for training a DeIdentificationModel model.

    Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

    DATE \d{4}
    AID \d{6,7}

    Additionally, obfuscation strings can be defined with DeidApproachParams.setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with DeidApproachParams.setRefFileFormat and DeidApproachParams.setRefSep.

    Dr. Gregory House#DOCTOR
    01010101#MEDICALRECORD

    The configuration params for that module are in trait DeIdentificationParams.

    Exceptions thrown

    java.security.NoSuchAlgorithmException If no Provider supports a SecureRandom implementation for specified algorithm name.

    Note

    If the mode is set to obfuscate, the DeIdentification uses java.security.SecureRandom for generating fake data. You can select a generation algorithm by configuring the system environment variable SPARK_NLP_JSL_SEED_ALGORITHM. The chosen algorithm may impact the generation of fake data, performance, and potential blocking issues. For information about standard RNG algorithm names, refer to the SecureRandom section in the Number Generation Algorithm. The default algorithm is 'SHA1PRNG'.

    See also

    DeIdentificationModel

    DeIdentificationParams

    DeidApproachParams

    train Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs Example of pipeline for deidentification.

    Example

    val documentAssembler = new DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
        .setInputCols(Array("document"))
        .setOutputCol("sentence")
        .setUseAbbreviations(true)
    
    val tokenizer = new Tokenizer()
        .setInputCols(Array("sentence"))
        .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel
        .pretrained("embeddings_clinical", "en", "clinical/models")
        .setInputCols(Array("sentence", "token"))
        .setOutputCol("embeddings")

    Ner entities

     val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")
    
     val nerConverter = new NerConverter()
         .setInputCols(Array("sentence", "token", "ner"))
         .setOutputCol("ner_chunk")

    Deidentification

    val deIdentification = new DeIdentification()
        .setInputCols(Array("ner_chunk", "token", "sentence"))
        .setOutputCol("dei")
        // file with custom regex patterns for custom entities
        .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
        // file with custom obfuscator names for the entities
        .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
        .setRefFileFormat("csv")
        .setRefSep("#")
        .setMode("obfuscate")
        .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
        .setObfuscateDate(true)
        .setDateTag("DATE")
        .setDays(5)
        .setObfuscateRefSource("file")

    Pipeline

    val data = Seq(
      "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
    ).toDF("text")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      clinical_sensitive_entities,
      nerConverter,
      deIdentification
    ))
    val result = pipeline.fit(data).transform(data)
    
    
    result.select("dei.result").show(truncate = false)

    Show Results

    result.select("dei.result").show(truncate = false)
    +--------------------------------------------------------------------------------------------------+
    |result                                                                                            |
    +--------------------------------------------------------------------------------------------------+
    |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
    +--------------------------------------------------------------------------------------------------+
  4. class DeIdentificationModel extends AnnotatorModel[DeIdentificationModel] with DeIdentificationParams with DeidModelParams with HasSimpleAnnotate[DeIdentificationModel] with HandleExceptionParams with HasSafeAnnotate[DeIdentificationModel] with CheckLicense

    Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.

    Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.

    To create an configured DeIdentificationModel, please see the example of DeIdentification.

    See also

    BaseDeidParams to see params

    DeIdentificationParams to see params

    DeidModelParams to see params

    DeIdentification to train your own model

  5. trait DeIdentificationParams extends BaseDeidParams

    A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators.

    A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators.

    See also

    DeIdentification

    DeIdentificationModel

    BaseDeidParams

  6. trait DeidApproachParams extends Params

    A trait that contains all the params that are common in DeIdentification and NameChunkObfuscatorApproach, and ObfuscatorAnnotatorApproach.

    A trait that contains all the params that are common in DeIdentification and NameChunkObfuscatorApproach, and ObfuscatorAnnotatorApproach.

    See also

    DeIdentification

    ObfuscatorAnnotatorApproach

    NameChunkObfuscatorApproach

  7. trait DeidModelParams extends BaseDeidParams

    A trait that contains all the params that are common in DeIdentificationModel and ObfuscatorAnnotatorModel.

    A trait that contains all the params that are common in DeIdentificationModel and ObfuscatorAnnotatorModel.

    See also

    DeIdentificationModel

    ObfuscatorAnnotatorModel

    BaseDeidParams to see params

  8. class DocumentHashCoder extends Model[DocumentHashCoder] with RawAnnotator[DocumentHashCoder]
  9. case class MySentnece(content: String, start: Int, end: Int, index: Int, originalIndex: Int) extends Product with Serializable
  10. class NameChunkObfuscator extends AnnotatorModel[NameChunkObfuscator] with HasSimpleAnnotate[NameChunkObfuscator] with NameChunkObfuscatorParams with CheckLicense

    Contains all the parameters to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS.

    Contains all the parameters to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. Model can obfuscate the given names, remain others same.

    To create an configured NameChunkObfuscator, please see the example of NameChunkObfuscatorApproach.

    See also

    NameChunkObfuscatorParams to see params

    NameChunkObfuscatorApproach to train your own model

  11. class NameChunkObfuscatorApproach extends AnnotatorApproach[NameChunkObfuscator] with NameChunkObfuscatorParams with DeidApproachParams with CheckLicense

    Contains all the methods for training a NameChunkObfuscator model.

    Contains all the methods for training a NameChunkObfuscator model. This module can replace name entities with consistent fakers. Additionally, obfuscation names can be defined with setObfuscateRefFile, where each line is a mapping of name. The format and seperator can be speficied with setRefFileFormat and setRefSep.

    George#NAME
    Taylor#NAME

    The configuration params for that module are in trait NameChunkObfuscatorParams.

    See also

    NameChunkObfuscator

    NameChunkObfuscatorParams

    DeidApproachParams See Spark NLP Workshop for more examples of usage.

    Example

     val data = Seq("John Davies is a 62 y.o. patient admitted." +
      "He was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.")
      .toDF("text")
    
    val documentAssembler = new DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
    
    val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")
    
    val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
    val ner_converter_name = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

    NameChunkObfuscatorApproach

    val nameChunkObfuscator = new NameChunkObfuscatorApproach()
    .setInputCols("ner_chunk")
    .setOutputCol("replacement")
    .setRefFileFormat("csv")
    .setObfuscateRefFile("obfuscator_names.txt")
    .setRefSep("#")
    .setObfuscateRefSource("both")
    .setLanguage("en")
    
    val replacer_name = new Replacer()
    .setInputCols("replacement", "sentence")
    .setOutputCol("obfuscated_document_name")
    .setUseReplacement(true)

    Pipeline

     val pipeline = new Pipeline().setStages(Array(
     documentAssembler,
     tokenizer,
     word_embeddings,
     clinical_ner,
     ner_converter_name,
     nameChunkObfuscator,
     replacer_name
     ))
    
     val result = pipeline.fit(data).transform(data)
     result.select("text").show(false)
     result.selectExpr("explode(document_normalized.result) as normalized_text").show(false)
    +-----------------------------------------------------------------------------------------------------------------------------------+
    |text                                                                                                                               |
    +-----------------------------------------------------------------------------------------------------------------------------------+
    |John Davies is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.|
    +-----------------------------------------------------------------------------------------------------------------------------------+
    
    +-------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                               |
    +-------------------------------------------------------------------------------------------------------------------------------------+
    |[Charlestine is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lowery and was scheduled for emergency assessment.]|
    +-------------------------------------------------------------------------------------------------------------------------------------+
  12. trait NameChunkObfuscatorParams extends Params

    A trait that contains all the params that are common between NameChunkObfuscatorApproach and NameChunkObfuscator annotators

    A trait that contains all the params that are common between NameChunkObfuscatorApproach and NameChunkObfuscator annotators

    Attributes
    protected
    See also

    NameChunkObfuscatorApproach

    NameChunkObfuscator

  13. class ObfuscatorAnnotatorApproach extends AnnotatorApproach[ObfuscatorAnnotatorModel] with DeidApproachParams with ObfuscatorParams
  14. class ObfuscatorAnnotatorModel extends AnnotatorModel[ObfuscatorAnnotatorModel] with ObfuscatorParams with DeidModelParams with HasSimpleAnnotate[ObfuscatorAnnotatorModel]
  15. trait ObfuscatorParams extends BaseDeidParams

    A trait that contains all the params that are common in ObfuscatorAnnotatorModel and ObfuscatorAnnotatorApproach

    A trait that contains all the params that are common in ObfuscatorAnnotatorModel and ObfuscatorAnnotatorApproach

    Attributes
    protected
    See also

    ObfuscatorAnnotatorModel

    ObfuscatorAnnotatorApproach

    BaseDeidParams

  16. class ReIdentification extends AnnotatorModel[DeIdentificationModel] with HasSimpleAnnotate[DeIdentificationModel] with CheckLicense

    Reidentifies obfuscated entities by DeIdentification.

    Reidentifies obfuscated entities by DeIdentification. This annotator requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn. To see how the entities are deidentified, please refer to the example of that class.

    Example

    Define the reidentification stage and transform the deidentified documents

    val reideintification = new ReIdentification()
      .setInputCols("dei", "protectedEntities")
      .setOutputCol("reid")
      .transform(result)

    Show results

    result.select("dei.result").show(truncate = false)
    +--------------------------------------------------------------------------------------------------+
    |result                                                                                            |
    +--------------------------------------------------------------------------------------------------+
    |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
    +--------------------------------------------------------------------------------------------------+
    
    reideintification.selectExpr("explode(reid.result)").show(false)
    +-----------------------------------------------------------------------------------+
    |col                                                                                |
    +-----------------------------------------------------------------------------------+
    |# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.|
    +-----------------------------------------------------------------------------------+
    See also

    DeIdentification for deidentification of entities

  17. trait ReadablePretrainedDeId extends ParamsAndFeaturesReadable[DeIdentificationModel] with HasPretrained[DeIdentificationModel]
  18. trait ReadsFeatures extends ParamsAndFeaturesReadable[DeIdentificationModel]
  19. class Replacer extends AnnotatorModel[NerQuestionGenerator] with HasSimpleAnnotate[NerQuestionGenerator] with CheckLicense

    Replaces entities in the original text with new ones.

    Replaces entities in the original text with new ones.

    This class allows to replace entities in the original text with the ones obtained with, for example, DeIdentificationModel or DateNormalizer.

    Example

    val documentAssembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("sentence")
    
    val tokenizer = Tokenizer()
        .setInputCols("sentence")
        .setOutputCol("token")
    
    val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
            .setInputCols(["sentence", "token"])
            .setOutputCol("embeddings")
    
    val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
            .setInputCols(["sentence", "token", "embeddings"])
            .setOutputCol("ner")
    
    val ner_converter_name = NerConverterInternal()
            .setInputCols(["sentence","token","ner"])
            .setOutputCol("ner_chunk")
    
    val nameChunkObfuscator = NameChunkObfuscatorApproach()
        .setInputCols("ner_chunk")
        .setOutputCol("replacement")
        .setRefFileFormat("csv")
        .setObfuscateRefFile("names_test.txt")
        .setRefSep("#")
    
    val replacer_name = Replacer()
        .setInputCols("replacement","sentence")
        .setOutputCol("obfuscated_document_name")
        .setUseReplacement(True)
    
    val nlpPipeline = new Pipeline().setStages=Array(
            documentAssembler,
            tokenizer,
            word_embeddings,
            clinical_ner,
            ner_converter_name,
            nameChunkObfuscator,
            replacer_name,
            ))
    
    val empty_data = spark.createDataFrame([[""]]).toDF("text")
    val model_chunck_obfuscator = nlpPipeline.fit(empty_data)
    val sample_text = "John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment."
    val lmodel = new LightPipeline(model_chunck_obfuscator)
    val res = lmodel.fullAnnotate(sample_text)
    "Original text.  : " + res[0]['sentence'][0].result)
    "Obfuscated text : " + res[0]['obfuscated_document_name'][0].result)
    Original text.  :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.
    Obfuscated text :  Fitzpatrick is a <AGE> y.o. patient admitted. Mr. Bowman was seen by attending physician Dr. Acosta and was scheduled for emergency assessment.
  20. case class SentenceMaxException(message: String = "", cause: Throwable = None.orNull) extends Exception with Product with Serializable
  21. case class StructuredDeidentification(columnsMap: Map[String, String], seedMap: Map[String, Int] = Collections.emptyMap(), obfuscateRefFile: String = "", obfuscateRefSource: String = "both", days: Int = 0, useRandomDateDisplacement: Boolean = false, dateFormats: List[String] = ..., language: String = Language.English, idColumn: String = "") extends Product with Serializable

    Utility class that helps to obfuscate tabular data.

    Utility class that helps to obfuscate tabular data.

    columnsMap

    Is a map that allows to select the name of the column with the entity. The key of the the map is the column in the dataframe and the value of the map is the entity for that column. The default entities are:

    1. |Entity | description |
    • |location| A general location.|
    • |location-other| A location that is not country, street,hospital,city or state|
    • |street| A street|
    • |hospital| The name of a hospital.|
    • |city| A city|
    • |state|A state|
    • |zip| The zip code|
    • |country| A country|
    • |contact| The contact of one person|
    • |username|A username |
    • |phone| A number phone.|
    • |fax| The number fax|
    • |url| A url for internet|
    • |email| The email of one person|
    • |profession| A profession of one person|
    • |name| The name opf one person|
    • |doctor|The name of a doctor|
    • |patient| The name of the patient|
    • |id| A general Id number|
    • |bioid|Is a system to screen for protein interactions as they occur in living cells|
    • |age|The age of something or someone|
    • |organization| Name of one organization o company|
    • |healthplan| The id that identify the healthplan|
    • |medicalrecord| The identification of a medical record|
    • |device|The id that identified a device|
    • |date| A general date|
    • |ssn| A Social Security Number|
    • |ip| A Internet Protocol|
    • |passport| A random passport|
    • |dln| A Driver's License Number |
    • |npi| A National Provider Identifier|
    • |c_card| A credit card number|
    • |iban| A International Bank Account Number|
    • |dea| A Drug Enforcement Administration| If is not present will be masked.
    seedMap

    : Allow to add a seed to the column that you want to obfuscate. The seed used to randomly select the entities used during obfuscation mode. By providing the same seed, you can replicate the same mapping multiple times.

    obfuscateRefFile

    This is an optional parameter that allows to add your own terms to be used for obfuscation. The file contains as a key the entity and as the value the terms that will be used in the \ obfuscation.

    obfuscateRefSource

    The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The allowed values are the following: 'file': Takes the entities from the obfuscatorRefFile 'faker': Takes the entities from the Faker module 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.

    days

    Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used

    useRandomDateDisplacement

    Use a random displacement days in dates entities.If true use random displacement days in dates entities,if false use the days

    dateFormats

    Format of dates to displaceFormat of dates to displace

    language

    The language used to select faker entities. The values are the following: 'en'(English),'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian). Default:'en'

    idColumn

    The column that contains the id of the row. If provided, data will obfuscate consistently by idColumn, especially date entities.

  22. case class TextToDocumentColumns(columns: List[String]) extends Product with Serializable
    Attributes
    protected

Value Members

  1. val randomAlgorithm: String
  2. val securerandom: SecureRandom
  3. object DeIdentification extends DefaultParamsReadable[DeIdentification] with Serializable
  4. object DeIdentificationModel extends ReadablePretrainedDeId with ReadsFeatures with Serializable
  5. object DefaultRegex
    Attributes
    protected
  6. object DocumentHashCoder extends DefaultParamsReadable[DocumentHashCoder] with Serializable
  7. object Language
  8. object Obfuscator
    Attributes
    protected
  9. object ObfuscatorAnnotatorApproach extends DefaultParamsReadable[ObfuscatorAnnotatorApproach] with Serializable
  10. object ObfuscatorParams extends DefaultParamsReadable[DeIdentification] with Serializable
  11. object Replacer extends DefaultParamsReadable[Replacer] with Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped