Pretrained Zero-Shot PHI Detection for Deidentification (Zero-shot - Medium - Subentity - Docwise)

Description

Zero-shot Named Entity Recognition (NER) enables the identification of entities in text with minimal effort. By leveraging pre-trained language models and contextual understanding, zero-shot NER extends entity recognition capabilities to new domains and languages. While the model card includes default labels as examples, it is important to highlight that users are not limited to these labels. The model is designed to support any set of entity labels, allowing users to adapt it to their specific use cases. For best results, it is recommended to use labels that are conceptually similar to the provided defaults.

Predicted Entities

DATE, PATIENT, COUNTRY, PROFESSION, AGE, CITY, STATE, DOCTOR, HOSPITAL, IDNUM, ORGANIZATION, PHONE, STREET, ZIP

Copy S3 URI

How to use


document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

labels = ["AGE", "CITY", "COUNTRY", "DATE", "DOCTOR", "HOSPITAL", "IDNUM", "ORGANIZATION", 
          "PATIENT", "PHONE", "PROFESSION", "STATE", "STREET", "ZIP"]

pretrained_zero_shot_ner = PretrainedZeroShotNER().pretrained("zeroshot_ner_deid_subentity_docwise_medium", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")


pipeline = Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

data = spark.createDataFrame([["""Emily Davis, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy. She has an appointment scheduled for March 15, 2024."""]]).toDF("text")

result = pipeline.fit(data).transform(data)


val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

labels = Array("AGE", "CITY", "COUNTRY", "DATE", "DOCTOR", "HOSPITAL", "IDNUM", "ORGANIZATION", 
          "PATIENT", "PHONE", "PROFESSION", "STATE", "STREET", "ZIP")

val pretrained_zero_shot_ner = PretrainedZeroShotNER().pretrained("zeroshot_ner_deid_subentity_docwise_medium", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("ner")
    .setPredictionThreshold(0.5)
    .setLabels(labels)

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")


val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
))

val data = Seq([["""Emily Davis, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy. She has an appointment scheduled for March 15, 2024."""]]).toDF("text")

val result = pipeline.fit(data).transform(data)

Results


+---------------+-----+---+---------+----------+
|chunk          |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|Emily Davis    |1    |11 |PATIENT  |0.99571335|
|34-year-old    |16   |26 |AGE      |0.99942374|
|Michael Johnson|39   |53 |DOCTOR   |0.99911386|
|CarePlus Clinic|73   |87 |HOSPITAL |0.87801534|
|456 Elm Street |101  |114|STREET   |0.99828523|
|NewYork        |117  |123|CITY     |0.9850912 |
|NY             |126  |127|STATE    |0.9824218 |
|March 15, 2024 |208  |221|DATE     |0.9994097 |
+---------------+-----+---+---------+----------+

Model Information

Model Name: zeroshot_ner_deid_subentity_docwise_medium
Compatibility: Healthcare NLP 5.5.1+
License: Licensed
Edition: Official
Language: en
Size: 711.7 MB

Benchmarking

       label  precision    recall  f1-score   support
         AGE     0.8287    0.9367    0.8794      1074
        CITY     0.7973    0.9067    0.8485       525
     COUNTRY     0.8430    0.8146    0.8286       178
        DATE     0.9867    0.9675    0.9770      7995
      DOCTOR     0.9698    0.9125    0.9403      5134
    HOSPITAL     0.8397    0.8379    0.8388      2276
       IDNUM     0.8720    0.5780    0.6952       955
           O     0.9965    0.9959    0.9962    315085
ORGANIZATION     0.5846    0.6032    0.5938       189
     PATIENT     0.8419    0.9547    0.8947      2364
       PHONE     0.6293    0.8455    0.7216       492
  PROFESSION     0.7300    0.9028    0.8072       494
       STATE     0.6954    0.9073    0.7874       302
      STREET     0.8969    0.9636    0.9291       605
         ZIP     0.8879    1.0000    0.9406       198
    accuracy                         0.9903    337866
   macro avg     0.8267    0.8751    0.8452    337866
weighted avg     0.9908    0.9903    0.9904    337866