Pretrained Zero-Shot Named Entity Recognition (zeroshot_ner_oncology_biomarker_large)

Description

Zero-shot Named Entity Recognition (NER) enables the identification of entities in text with minimal effort. By leveraging pre-trained language models and contextual understanding, zero-shot NER extends entity recognition capabilities to new domains and languages. While the model card includes default labels as examples, it is important to highlight that users are not limited to these labels.

The model is designed to support any set of entity labels, allowing users to adapt it to their specific use cases. For best results, it is recommended to use labels that are conceptually similar to the provided defaults.

Predicted Entities

Biomarker, Biomarker_Result

Copy S3 URI

How to use


document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

labels = ['Biomarker', 'Biomarker_Result'] # You can change the entities
pretrained_zero_shot_ner = PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_biomarker_large", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pipeline = Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

data = spark.createDataFrame([["""The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87% ."""]]).toDF("text")

result = pipeline.fit(data).transform(data)


document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

labels = ['Biomarker', 'Biomarker_Result'] # You can change the entities
pretrained_zero_shot_ner = medical.PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_biomarker_large", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

data = spark.createDataFrame([["""The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87% ."""]]).toDF("text")

result = pipeline.fit(data).transform(data)


val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

labels = ["Biomarker", "Biomarker_Resul"] # You can change the entities
val pretrained_zero_shot_ner = PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_biomarker_large", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("ner")
    .setPredictionThreshold(0.5)
    .setLabels(labels)

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")


val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
))

val data = Seq("""The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87% .""").toDF("text")

val result = pipeline.fit(data).transform(data)

Results


+----------------------------------------+-----+---+----------------+----------+
|chunk                                   |begin|end|ner_label       |confidence|
+----------------------------------------+-----+---+----------------+----------+
|negative                                |71   |78 |Biomarker_Result|0.9885888 |
|CK7                                     |84   |86 |Biomarker       |0.9961201 |
|synaptophysin                           |89   |101|Biomarker       |0.9968708 |
|Syn                                     |104  |106|Biomarker       |0.7719608 |
|chromogranin A                          |110  |123|Biomarker       |0.9940532 |
|Muc5AC                                  |132  |137|Biomarker       |0.99384034|
|human epidermal growth factor receptor-2|140  |179|Biomarker       |0.9571454 |
|HER2                                    |182  |185|Biomarker       |0.8589217 |
|Muc6                                    |193  |196|Biomarker       |0.9970079 |
|positive                                |199  |206|Biomarker_Result|0.9965706 |
|CK20                                    |212  |215|Biomarker       |0.9980994 |
|Muc1                                    |218  |221|Biomarker       |0.9856312 |
|Muc2                                    |224  |227|Biomarker       |0.9621688 |
|E-cadherin                              |230  |239|Biomarker       |0.98347497|
|p53                                     |246  |248|Biomarker       |0.99129874|
|Ki-67 index                             |255  |265|Biomarker       |0.7583356 |
|87%                                     |277  |279|Biomarker_Result|0.94230235|
+----------------------------------------+-----+---+----------------+----------+

Model Information

Model Name: zeroshot_ner_oncology_biomarker_large
Compatibility: Healthcare NLP 5.5.1+
License: Licensed
Edition: Official
Language: en
Size: 1.6 GB

Benchmarking

           label  precision    recall  f1-score   support
       Biomarker     0.9056    0.9210    0.9133      2178
Biomarker_Result     0.8310    0.8969    0.8627      1464
        accuracy          -         -    0.9436     13697
       macro avg     0.9022    0.9244    0.9128     13697
    weighted avg     0.9449    0.9436    0.9441     13697