Extract Oncology Tests

Description

This model extracts mentions of tests from oncology texts, including pathology tests and imaging tests.

Definitions of Predicted Entities:

  • Biomarker: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category.
  • Biomarker_Result: Terms or values that are identified as the result of a biomarkers.
  • Imaging_Test: Imaging tests mentioned in texts, such as “chest CT scan”.
  • Oncogene: Mentions of genes that are implicated in the etiology of cancer.
  • Pathology_Test: Mentions of biopsies or tests that use tissue samples.

Predicted Entities

Biomarker, Biomarker_Result, Imaging_Test, Oncogene, Pathology_Test

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")                

ner = MedicalNerModel.pretrained("ner_oncology_test_wip", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

data = spark.createDataFrame([["A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative."]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")                
    
val ner = MedicalNerModel.pretrained("ner_oncology_test_wip", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

        
val pipeline = new Pipeline().setStages(Array(document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter))    

val data = Seq("A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("en.med_ner.oncology_test_wip").predict("""A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.""")

Results

| chunk                     | ner_label      |
|:--------------------------|:---------------|
| biopsy                    | Pathology_Test |
| ultrasound                | Imaging_Test   |
| chest computed tomography | Imaging_Test   |
| CT                        | Imaging_Test   |

Model Information

Model Name: ner_oncology_test_wip
Compatibility: Healthcare NLP 4.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 858.2 KB

References

In-house annotated oncology case reports.

Benchmarking

           label     tp    fp    fn  total  precision  recall   f1
    Imaging_Test 1518.0 156.0 191.0 1709.0       0.91    0.89 0.90
Biomarker_Result  861.0 145.0 245.0 1106.0       0.86    0.78 0.82
  Pathology_Test  600.0 105.0 209.0  809.0       0.85    0.74 0.79
       Biomarker  917.0 166.0 194.0 1111.0       0.85    0.83 0.84
        Oncogene  274.0  84.0  83.0  357.0       0.77    0.77 0.77
       macro_avg 4170.0 656.0 922.0 5092.0       0.85    0.80 0.82
       micro_avg    NaN   NaN   NaN    NaN       0.86    0.82 0.84