Pretrained Zero-Shot Named Entity Recognition (zeroshot_ner_oncology_medium)

Description

Zero-shot Named Entity Recognition (NER) enables the identification of entities in text with minimal effort. By leveraging pre-trained language models and contextual understanding, zero-shot NER extends entity recognition capabilities to new domains and languages. While the model card includes default labels as examples, it is important to highlight that users are not limited to these labels.

The model is designed to support any set of entity labels, allowing users to adapt it to their specific use cases. For best results, it is recommended to use labels that are conceptually similar to the provided defaults.

Predicted Entities

Adenopathy, Age, Biomarker, Biomarker_Result, Body_Part, Cancer_Dx, Cancer_Surgery,
Cycle_Count, Cycle_Day, Date, Death_Entit, Directio, Dosage, Duration, Frequency,
Gender, Grade, Histological_Type, Imaging_Test, Invasion, Metastasis, Oncogene, Pathology_Test,
Race_Ethnicity, Radiation_Dose, Relative_Date, Response_To_Treatment, Route, Smoking_Status,
Staging, Therapy, Tumor_Finding, Tumor_Size

Copy S3 URI

How to use


document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

labels = ["Adenopathy", "Age","Biomarker","Biomarker_Result","Body_Part","Cancer_Dx","Cancer_Surgery",
    "Cycle_Count","Cycle_Day","Date","Death_Entit","Directio","Dosage","Duration","Frequency",
    "Gender","Grade","Histological_Type","Imaging_Test","Invasion","Metastasis","Oncogene","Pathology_Test",
    "Race_Ethnicity","Radiation_Dose","Relative_Date","Response_To_Treatment","Route","Smoking_Status",
    "Staging","Therapy","Tumor_Finding","Tumor_Size"]

pretrained_zero_shot_ner = PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_medium", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")


pipeline = Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

data = spark.createDataFrame([["""Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis."""]]).toDF("text")

result = pipeline.fit(data).transform(data)


val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

labels = Array("Adenopathy", "Age","Biomarker","Biomarker_Result","Body_Part","Cancer_Dx","Cancer_Surgery",
    "Cycle_Count","Cycle_Day","Date","Death_Entit","Directio","Dosage","Duration","Frequency",
    "Gender","Grade","Histological_Type","Imaging_Test","Invasion","Metastasis","Oncogene","Pathology_Test",
    "Race_Ethnicity","Radiation_Dose","Relative_Date","Response_To_Treatment","Route","Smoking_Status",
    "Staging","Therapy","Tumor_Finding","Tumor_Size")

val pretrained_zero_shot_ner = PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_medium", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("ner")
    .setPredictionThreshold(0.5)
    .setLabels(labels)

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")


val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
))

val data = Seq([["""Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis."""]]).toDF("text")

val result = pipeline.fit(data).transform(data)

Results


+-------------+-----+---+-----------------+----------+
|chunk        |begin|end|ner_label        |confidence|
+-------------+-----+---+-----------------+----------+
|Two years ago|1    |13 |Relative_Date    |0.9153258 |
|tumor        |45   |49 |Tumor_Finding    |0.98980695|
|her          |54   |56 |Gender           |0.99849236|
|left         |58   |61 |Direction        |0.99010885|
|breast       |63   |68 |Body_Part        |0.97540295|
|adenopathies |74   |85 |Adenopathy       |0.83176845|
|She          |88   |90 |Gender           |0.9997961 |
|invasive     |111  |118|Invasion         |0.93775606|
|ductal       |120  |125|Histological_Type|0.90716   |
|carcinoma    |127  |135|Cancer_Dx        |0.946235  |
|Last week    |138  |146|Relative_Date    |0.8142577 |
|she          |148  |150|Gender           |0.99979   |
|lung         |177  |180|Body_Part        |0.98785883|
|metastasis   |182  |191|Metastasis       |0.99683565|
+-------------+-----+---+-----------------+----------+

Model Information

Model Name: zeroshot_ner_oncology_medium
Compatibility: Healthcare NLP 5.5.1+
License: Licensed
Edition: Official
Language: en
Size: 711.2 MB

Benchmarking

                label  precision    recall  f1-score   support
           Adenopathy     0.4527    0.7283    0.5583        92
                  Age     0.9583    0.9851    0.9715      1074
            Biomarker     0.6589    0.8191    0.7303      1752
     Biomarker_Result     0.4782    0.8450    0.6107      1632
            Body_Part     0.6761    0.8969    0.7710      3540
            Cancer_Dx     0.8778    0.7397    0.8029      1360
       Cancer_Surgery     0.6859    0.7143    0.6998       749
          Cycle_Count     0.8182    0.8229    0.8205       350
            Cycle_Day     0.6464    0.7126    0.6779       254
                 Date     0.9743    0.9870    0.9806       921
         Death_Entity     0.9189    0.9444    0.9315        36
            Direction     0.7552    0.8412    0.7958       957
               Dosage     0.7772    0.8164    0.7963      1111
             Duration     0.6831    0.8566    0.7601       760
            Frequency     0.7599    0.7310    0.7451       394
               Gender     0.9807    0.9876    0.9841      1286
                Grade     0.5390    0.6434    0.5866       258
    Histological_Type     0.5192    0.6239    0.5668       476
         Imaging_Test     0.8460    0.8503    0.8482      2145
             Invasion     0.4538    0.8674    0.5958       181
           Metastasis     0.9441    0.8579    0.8989       394
             Oncogene     0.6986    0.5322    0.6041       466
       Pathology_Test     0.7308    0.7082    0.7193      1100
       Race_Ethnicity     0.8889    0.9492    0.9180        59
       Radiation_Dose     0.5897    0.8156    0.6845       141
        Relative_Date     0.8698    0.6558    0.7478      1284
Response_To_Treatment     0.4634    0.6412    0.5380       641
                Route     0.5722    0.7055    0.6319       146
       Smoking_Status     0.9200    0.8070    0.8598        57
              Staging     0.5579    0.7130    0.6260       223
              Therapy     0.7379    0.8733    0.7999      2012
        Tumor_Finding     0.8662    0.8634    0.8648      1252
           Tumor_Size     0.7529    0.9859    0.8538      1066
             accuracy          -         -    0.8744     94819
            macro avg     0.7354    0.8063    0.7618     94819
         weighted avg     0.8907    0.8744    0.8792     94819