package deid
- Alphabetic
- By Inheritance
- deid
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
trait
BaseDeidParams extends Params
A trait that contains all the params that are common in DeIdentificationParams and ObfuscatorParams.
A trait that contains all the params that are common in DeIdentificationParams and ObfuscatorParams.
- class DateChunkObfuscator extends AnnotatorModel[DateChunkObfuscator] with HasSimpleAnnotate[DateChunkObfuscator] with CheckLicense
-
class
DeIdentification extends AnnotatorApproach[DeIdentificationModel] with DeIdentificationParams with DeidApproachParams with HandleExceptionParams with CheckLicense
Contains all the methods for training a DeIdentificationModel model.
Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.
DATE \d{4} AID \d{6,7}
Additionally, obfuscation strings can be defined with DeidApproachParams.setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with DeidApproachParams.setRefFileFormat and DeidApproachParams.setRefSep.
Dr. Gregory House#DOCTOR 01010101#MEDICALRECORD
The configuration params for that module are in trait DeIdentificationParams.
- Exceptions thrown
java.security.NoSuchAlgorithmException
If no Provider supports a SecureRandom implementation for specified algorithm name.- Note
If the mode is set to obfuscate, the DeIdentification uses java.security.SecureRandom for generating fake data. You can select a generation algorithm by configuring the system environment variable SPARK_NLP_JSL_SEED_ALGORITHM. The chosen algorithm may impact the generation of fake data, performance, and potential blocking issues. For information about standard RNG algorithm names, refer to the SecureRandom section in the Number Generation Algorithm. The default algorithm is 'SHA1PRNG'.
- See also
train Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs Example of pipeline for deidentification.
Example
val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") .setUseAbbreviations(true) val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings")
Ner entities
val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner") val nerConverter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk")
Deidentification
val deIdentification = new DeIdentification() .setInputCols(Array("ner_chunk", "token", "sentence")) .setOutputCol("dei") // file with custom regex patterns for custom entities .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") // file with custom obfuscator names for the entities .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") .setRefFileFormat("csv") .setRefSep("#") .setMode("obfuscate") .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) .setObfuscateDate(true) .setDateTag("DATE") .setDays(5) .setObfuscateRefSource("file")
Pipeline
val data = Seq( "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09." ).toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_sensitive_entities, nerConverter, deIdentification )) val result = pipeline.fit(data).transform(data) result.select("dei.result").show(truncate = false)
Show Results
result.select("dei.result").show(truncate = false) +--------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------+ |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]| +--------------------------------------------------------------------------------------------------+
-
class
DeIdentificationModel extends AnnotatorModel[DeIdentificationModel] with DeIdentificationParams with DeidModelParams with HasSimpleAnnotate[DeIdentificationModel] with HandleExceptionParams with HasSafeAnnotate[DeIdentificationModel] with CheckLicense
Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.
Contains all the parameters to transform a dataset with three Input Annotations of types DOCUMENT, TOKEN and CHUNK, into its DeIdentified version of by either masking or obfuscating the given CHUNKS.
Additionally, an optional DOCUMENT column can be given to calculate join sentences boundaries if outputAsDocument is true.
To create an configured DeIdentificationModel, please see the example of DeIdentification.
- See also
BaseDeidParams to see params
DeIdentificationParams to see params
DeidModelParams to see params
DeIdentification to train your own model
-
trait
DeIdentificationParams extends BaseDeidParams with HasFeatures
A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators.
A trait that contains all the params that are common between DeIdentificationModel and DeIdentification annotators.
-
trait
DeidApproachParams extends Params
A trait that contains all the params that are common in DeIdentification and NameChunkObfuscatorApproach, and ObfuscatorAnnotatorApproach.
A trait that contains all the params that are common in DeIdentification and NameChunkObfuscatorApproach, and ObfuscatorAnnotatorApproach.
-
trait
DeidModelParams extends BaseDeidParams
A trait that contains all the params that are common in DeIdentificationModel and ObfuscatorAnnotatorModel.
A trait that contains all the params that are common in DeIdentificationModel and ObfuscatorAnnotatorModel.
- See also
BaseDeidParams to see params
- class DocumentHashCoder extends Model[DocumentHashCoder] with RawAnnotator[DocumentHashCoder]
-
class
LightDeIdentification extends AnnotatorModel[LightDeIdentification] with HasSimpleAnnotate[LightDeIdentification] with DeidModelParams with LightDeIdentificationParams with CheckLicense
Light DeIdentification is a light version of DeIdentification.
Light DeIdentification is a light version of DeIdentification. It replaces sensitive information in a text with obfuscated or masked fakers. It is designed to work with healthcare data, and it can be used to de-identify patient names, dates, and other sensitive information. It can also be used to obfuscate or mask any other type of sensitive information, such as doctor names, hospital names, and other types of sensitive information.
Additionally, it supports millions of embedded fakers and If desired, custom external fakers can be set with LightDeIdentificationParams.setCustomFakers .
It also supports multiple languages such as English, Spanish, French, German, and Arabic. And it supports multi-mode de-identification with LightDeIdentificationParams.setSelectiveObfuscationModes at the same time.
Example:
val documentAssembler = new DocumentAssembler() .setInputCol("text").setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")).setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")).setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")).setOutputCol("embeddings") val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner") val nerConverter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")).setOutputCol("chunk") val deIdentification = new LightDeIdentification() .setInputCols(Array("chunk", "sentence")).setOutputCol("dei") .setMode("obfuscate") .setObfuscateDate(true) .setDays(5) val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_sensitive_entities, nerConverter, deIdentification )) import spark.implicits._ val data = Seq(""" |Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson Ora. | MR # 7194334 Date: 01/13/93. PCP: Oliveira, 25 years-old, Record date: 2079-11-09. |Cocke County Baptist Hospital, 0295 Keats Street, Phone 55-555-5555.""".stripMargin ).toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("explode(dei) as result").show(truncate = false)
Results:
+--------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------------------------------------------------------+ |{document, 0, 69, Record date: 2093-01-18, Chestine Spore, M.D., Name: Sallyanne Havers., {sentence -> 0, originalIndex -> 2}, []} | |{document, 70, 97, MR # 8469629 Date: 01/18/93., {sentence -> 1, originalIndex -> 71}, []} | |{document, 98, 156, PCP: Derrill Center, 38 years-old, Record date: 2079-11-14., {sentence -> 2, originalIndex -> 100}, []} | |{document, 157, 237, SELECT SPECIALTY HOSPITAL - DALLAS (GARLAND), 101 Hospital Rd, Phone 52-841-3244., {sentence -> 3, originalIndex -> 155}, []}| +--------------------------------------------------------------------------------------------------------------------------------------------------+
- Exceptions thrown
java.security.NoSuchAlgorithmException
If no Provider supports a SecureRandom implementation for specified algorithm name. See for more information and parameters DeidModelParams and LightDeIdentificationParams- Note
If the mode is set to obfuscate, the LightDeIdentification uses java.security.SecureRandom for generating fake data. You can select a generation algorithm by configuring the system environment variable SPARK_NLP_JSL_SEED_ALGORITHM. The chosen algorithm may impact the generation of fake data, performance, and potential blocking issues. For information about standard RNG algorithm names, refer to the SecureRandom section in the Number Generation Algorithm. The default algorithm is 'SHA1PRNG'.
- See also
-
trait
LightDeIdentificationParams extends Params
A trait that contains params that LightDeIdentification has.
A trait that contains params that LightDeIdentification has.
- See also
- case class MySentnece(content: String, start: Int, end: Int, index: Int, originalIndex: Int) extends Product with Serializable
-
class
NameChunkObfuscator extends AnnotatorModel[NameChunkObfuscator] with HasSimpleAnnotate[NameChunkObfuscator] with NameChunkObfuscatorParams with CheckLicense
Contains all the parameters to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS.
Contains all the parameters to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. Model can obfuscate the given names, remain others same.
To create an configured NameChunkObfuscator, please see the example of NameChunkObfuscatorApproach.
- See also
NameChunkObfuscatorParams to see params
NameChunkObfuscatorApproach to train your own model
-
class
NameChunkObfuscatorApproach extends AnnotatorApproach[NameChunkObfuscator] with NameChunkObfuscatorParams with DeidApproachParams with CheckLicense
Contains all the methods for training a NameChunkObfuscator model.
Contains all the methods for training a NameChunkObfuscator model. This module can replace name entities with consistent fakers. Additionally, obfuscation names can be defined with setObfuscateRefFile, where each line is a mapping of name. The format and seperator can be speficied with setRefFileFormat and setRefSep.
George#NAME Taylor#NAME
The configuration params for that module are in trait NameChunkObfuscatorParams.
- See also
DeidApproachParams See Spark NLP Workshop for more examples of usage.
Example
val data = Seq("John Davies is a 62 y.o. patient admitted." + "He was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.") .toDF("text") val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter_name = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk")
NameChunkObfuscatorApproach
val nameChunkObfuscator = new NameChunkObfuscatorApproach() .setInputCols("ner_chunk") .setOutputCol("replacement") .setRefFileFormat("csv") .setObfuscateRefFile("obfuscator_names.txt") .setRefSep("#") .setObfuscateRefSource("both") .setLanguage("en") val replacer_name = new Replacer() .setInputCols("replacement", "sentence") .setOutputCol("obfuscated_document_name") .setUseReplacement(true)
Pipeline
val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, word_embeddings, clinical_ner, ner_converter_name, nameChunkObfuscator, replacer_name )) val result = pipeline.fit(data).transform(data) result.select("text").show(false) result.selectExpr("explode(document_normalized.result) as normalized_text").show(false) +-----------------------------------------------------------------------------------------------------------------------------------+ |text | +-----------------------------------------------------------------------------------------------------------------------------------+ |John Davies is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.| +-----------------------------------------------------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------------------------------------------------------+ |result | +-------------------------------------------------------------------------------------------------------------------------------------+ |[Charlestine is a 62 y.o. patient admitted.He was seen by attending physician Dr. Lowery and was scheduled for emergency assessment.]| +-------------------------------------------------------------------------------------------------------------------------------------+
-
trait
NameChunkObfuscatorParams extends Params
A trait that contains all the params that are common between NameChunkObfuscatorApproach and NameChunkObfuscator annotators
A trait that contains all the params that are common between NameChunkObfuscatorApproach and NameChunkObfuscator annotators
- Attributes
- protected
- See also
- class ObfuscatorAnnotatorApproach extends AnnotatorApproach[ObfuscatorAnnotatorModel] with DeidApproachParams with ObfuscatorParams
- class ObfuscatorAnnotatorModel extends AnnotatorModel[ObfuscatorAnnotatorModel] with ObfuscatorParams with DeidModelParams with HasSimpleAnnotate[ObfuscatorAnnotatorModel]
-
trait
ObfuscatorParams extends BaseDeidParams
A trait that contains all the params that are common in ObfuscatorAnnotatorModel and ObfuscatorAnnotatorApproach
A trait that contains all the params that are common in ObfuscatorAnnotatorModel and ObfuscatorAnnotatorApproach
- Attributes
- protected
- See also
-
class
ReIdentification extends AnnotatorModel[DeIdentificationModel] with HasSimpleAnnotate[DeIdentificationModel] with CheckLicense
Reidentifies obfuscated entities by
DeIdentification
.Reidentifies obfuscated entities by
DeIdentification
. This annotator requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn. To see how the entities are deidentified, please refer to the example of that class.Example
Define the reidentification stage and transform the deidentified documents
val reideintification = new ReIdentification() .setInputCols("dei", "protectedEntities") .setOutputCol("reid") .transform(result)
Show results
result.select("dei.result").show(truncate = false) +--------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------+ |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]| +--------------------------------------------------------------------------------------------------+ reideintification.selectExpr("explode(reid.result)").show(false) +-----------------------------------------------------------------------------------+ |col | +-----------------------------------------------------------------------------------+ |# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.| +-----------------------------------------------------------------------------------+
- See also
DeIdentification for deidentification of entities
- trait ReadablePretrainedDeId extends ParamsAndFeaturesReadable[DeIdentificationModel] with HasPretrained[DeIdentificationModel]
- trait ReadsFeatures extends ParamsAndFeaturesReadable[DeIdentificationModel]
-
class
Replacer extends AnnotatorModel[NerQuestionGenerator] with HasSimpleAnnotate[NerQuestionGenerator] with CheckLicense
Replaces entities in the original text with new ones.
Replaces entities in the original text with new ones.
This class allows to replace entities in the original text with the ones obtained with, for example, DeIdentificationModel or DateNormalizer.
Example
val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(["sentence", "token"]) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") .setInputCols(["sentence", "token", "embeddings"]) .setOutputCol("ner") val ner_converter_name = NerConverterInternal() .setInputCols(["sentence","token","ner"]) .setOutputCol("ner_chunk") val nameChunkObfuscator = NameChunkObfuscatorApproach() .setInputCols("ner_chunk") .setOutputCol("replacement") .setRefFileFormat("csv") .setObfuscateRefFile("names_test.txt") .setRefSep("#") val replacer_name = Replacer() .setInputCols("replacement","sentence") .setOutputCol("obfuscated_document_name") .setUseReplacement(True) val nlpPipeline = new Pipeline().setStages=Array( documentAssembler, tokenizer, word_embeddings, clinical_ner, ner_converter_name, nameChunkObfuscator, replacer_name, )) val empty_data = spark.createDataFrame([[""]]).toDF("text") val model_chunck_obfuscator = nlpPipeline.fit(empty_data) val sample_text = "John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment." val lmodel = new LightPipeline(model_chunck_obfuscator) val res = lmodel.fullAnnotate(sample_text) "Original text. : " + res[0]['sentence'][0].result) "Obfuscated text : " + res[0]['obfuscated_document_name'][0].result) Original text. : John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment. Obfuscated text : Fitzpatrick is a <AGE> y.o. patient admitted. Mr. Bowman was seen by attending physician Dr. Acosta and was scheduled for emergency assessment.
- case class SentenceMaxException(message: String = "", cause: Throwable = None.orNull) extends Exception with Product with Serializable
-
case class
StructuredDeidentification(columnsMap: Map[String, String], seedMap: Map[String, Int] = Collections.emptyMap(), obfuscateRefFile: String = "", obfuscateRefSource: String = "both", days: Int = 0, useRandomDateDisplacement: Boolean = false, dateFormats: List[String] = ..., language: String = Language.English, idColumn: String = "") extends Product with Serializable
Utility class that helps to obfuscate tabular data.
Utility class that helps to obfuscate tabular data.
- columnsMap
Is a map that allows to select the name of the column with the entity. The key of the the map is the column in the dataframe and the value of the map is the entity for that column. The default entities are:
- |Entity | description |
- |location| A general location.|
- |location-other| A location that is not country, street,hospital,city or state|
- |street| A street|
- |hospital| The name of a hospital.|
- |city| A city|
- |state|A state|
- |zip| The zip code|
- |country| A country|
- |contact| The contact of one person|
- |username|A username |
- |phone| A number phone.|
- |fax| The number fax|
- |url| A url for internet|
- |email| The email of one person|
- |profession| A profession of one person|
- |name| The name opf one person|
- |doctor|The name of a doctor|
- |patient| The name of the patient|
- |id| A general Id number|
- |bioid|Is a system to screen for protein interactions as they occur in living cells|
- |age|The age of something or someone|
- |organization| Name of one organization o company|
- |healthplan| The id that identify the healthplan|
- |medicalrecord| The identification of a medical record|
- |device|The id that identified a device|
- |date| A general date|
- |ssn| A Social Security Number|
- |ip| A Internet Protocol|
- |passport| A random passport|
- |dln| A Driver's License Number |
- |npi| A National Provider Identifier|
- |c_card| A credit card number|
- |iban| A International Bank Account Number|
- |dea| A Drug Enforcement Administration| If is not present will be masked.
- seedMap
: Allow to add a seed to the column that you want to obfuscate. The seed used to randomly select the entities used during obfuscation mode. By providing the same seed, you can replicate the same mapping multiple times.
- obfuscateRefFile
This is an optional parameter that allows to add your own terms to be used for obfuscation. The file contains as a key the entity and as the value the terms that will be used in the \ obfuscation.
- obfuscateRefSource
The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. The allowed values are the following: 'file': Takes the entities from the obfuscatorRefFile 'faker': Takes the entities from the Faker module 'both': Takes the entities from the obfuscatorRefFile and the faker module randomly.
- days
Number of days to obfuscate the dates by displacement. If not provided a random integer between 1 and 60 will be used
- useRandomDateDisplacement
Use a random displacement days in dates entities.If true use random displacement days in dates entities,if false use the days
- dateFormats
Format of dates to displaceFormat of dates to displace
- language
The language used to select faker entities. The values are the following: 'en'(English),'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian). Default:'en'
- idColumn
The column that contains the id of the row. If provided, data will obfuscate consistently by idColumn, especially date entities.
-
case class
TextToDocumentColumns(columns: List[String]) extends Product with Serializable
- Attributes
- protected
Value Members
- val randomAlgorithm: String
- val securerandom: SecureRandom
- object DeIdentification extends DefaultParamsReadable[DeIdentification] with Serializable
- object DeIdentificationModel extends ReadablePretrainedDeId with ReadsFeatures with Serializable
-
object
DefaultRegex
- Attributes
- protected
- object DocumentHashCoder extends DefaultParamsReadable[DocumentHashCoder] with Serializable
- object Language
-
object
LightDeIdentification extends ParamsAndFeaturesReadable[LightDeIdentification] with Serializable
This is the companion object of LightDeIdentification.
This is the companion object of LightDeIdentification. Please refer to that class for the documentation.
-
object
Obfuscator
- Attributes
- protected
- object ObfuscatorAnnotatorApproach extends DefaultParamsReadable[ObfuscatorAnnotatorApproach] with Serializable
- object ObfuscatorParams extends DefaultParamsReadable[DeIdentification] with Serializable
- object Replacer extends ParamsAndFeaturesReadable[Replacer] with Serializable