Deidentify (Enriched)

Description

Deidentify (Large) is a deidentification model. It identifies instances of protected health information in text documents, and it can either obfuscate them (e.g., replacing names with different, fake names) or mask them (e.g., replacing “2020-06-04” with Some faker data). This model is useful for maintaining HIPAA compliance when dealing with text documents that contain protected health information.

Predicted Entities

  • PHONE
  • PATIENT
  • COUNTRY
  • USERNAME
  • LOCATION-OTHER
  • DATE
  • ID
  • DOCTOR
  • HOSPITAL
  • IDNUM
  • AGE
  • MEDICALRECORD
  • CITY
  • FAX
  • ZIP
  • HEALTHPLAN
  • PROFESSION
  • BIOID
  • URL
  • EMAIL
  • STATE
  • ORGANIZATION
  • STREET
  • DEVICE

Download

How to use

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter])

text ='''
A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street
'''
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

obfuscation = DeIdentificationModel.pretrained("deidentify_enriched_clinical", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("obfuscated") \
      .setMode("obfuscate")

obfusated_text = obfuscation.transform(result)

val nlpPipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_sensitive_entities, nerConverter, de_identification))

val data = Seq("A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street").toDF("text")
val result = pipeline.fit(data).transform(data)

val obfuscation = DeIdentificationModel.pretrained("deidentify_enriched_clinical", "en", "clinical/models")
        .setInputCols(Array("sentence", "token", "ner_chunk"))
        .setOutputCol("obfuscated")
        .setMode("obfuscate")

val obfusatedText = obfuscation.transform(result)

Results

	sentence	deidentified
0	A .	A .
1	Record date : 2093-01-13 , David Hale , M.D .	Record date : 2093-01-18 , DR. Gregory Kaiser , M.D .
2	, Name : Hendrickson , Ora MR .	, Name : Joel Vasquez MR .
3	# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .	# 67696 Date : 01/18/93 PCP : DR. Jennifer Eaton , 25 years-old , Record date : 2079-11-14 .
4	Cocke County Baptist Hospital .	San Leandro Hospital – San Leandro .
5	0295 Keats Street	3744 Retreat Avenue

Model Information

Model Name: deidentify_enriched_clinical
Compatibility: Spark NLP 2.7.2+
License: Licensed
Edition: Official
Input Labels: [sentence, token, ner_chunk]
Output Labels: [deidentified]
Language: en