Description
Deidentify (Large) is a deidentification model. It identifies instances of protected health information in text documents, and it can either obfuscate them (e.g., replacing names with different, fake names) or mask them (e.g., replacing “2020-06-04” with Some faker data). This model is useful for maintaining HIPAA compliance when dealing with text documents that contain protected health information.
Predicted Entities
PHONE PATIENT COUNTRY USERNAME LOCATION-OTHER DATE ID DOCTOR HOSPITAL IDNUM AGE MEDICALRECORD CITY FAX ZIP HEALTHPLAN PROFESSION BIOID URL EMAIL STATE ORGANIZATION STREET DEVICE
How to use
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter])
text ='''
A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street
'''
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
obfuscation = DeIdentificationModel.pretrained("deidentify_enriched_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("obfuscated") \
.setMode("obfuscate")
obfusated_text = obfuscation.transform(result)
val nlpPipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_sensitive_entities, nerConverter, de_identification))
val result = pipeline.fit(Seq.empty["""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street'''""].toDS.toDF("text")).transform(data)
val obfuscation = DeIdentificationModel.pretrained("deidentify_enriched_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "ner_chunk"))
.setOutputCol("obfuscated")
.setMode("obfuscate")
val obfusatedText = obfuscation.transform(result)
Results
sentence deidentified
0 A . A .
1 Record date : 2093-01-13 , David Hale , M.D . Record date : 2093-01-18 , DR. Gregory Kaiser , M.D .
2 , Name : Hendrickson , Ora MR . , Name : Joel Vasquez MR .
3 # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . # 67696 Date : 01/18/93 PCP : DR. Jennifer Eaton , 25 years-old , Record date : 2079-11-14 .
4 Cocke County Baptist Hospital . San Leandro Hospital – San Leandro .
5 0295 Keats Street 3744 Retreat Avenue
Model Information
Model Name: | deidentify_enriched_clinical |
Compatibility: | Spark NLP 2.7.2+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, ner_chunk] |
Output Labels: | [deidentified] |
Language: | en |