Detect PHI in text (enriched-biobert)

Description

Detect sensitive information in text for de-identification using pretrained NER model.

Predicted Entities

DOCTOR, PHONE, COUNTRY, MEDICALRECORD, STREET, CITY, PROFESSION, PATIENT, IDNUM, BIOID, HEALTHPLAN, HOSPITAL, USERNAME, OTHER, AGE, FAX, EMAIL, DATE, STATE, ZIP, URL, ORGANIZATION, DEVICE

Live Demo Open in Colab Download

How to use


...
embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")  .setInputCols(["sentence", "token"])  .setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_enriched_biobert", "en", "clinical/models")   .setInputCols(["sentence", "token", "embeddings"])   .setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))

...
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_deid_enriched_biobert", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val result = pipeline.fit(Seq.empty[""].toDS.toDF("text")).transform(data)

Model Information

Model Name: ner_deid_enriched_biobert
Compatibility: Spark NLP for Healthcare 3.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en