Extract Oncology Tests (LangTest)

Description

This model extracts mentions of tests from oncology texts, including pathology tests and imaging tests. It is the version of ner_oncology_test model augmented with langtest library.

Definitions of Predicted Entities:

Biomarker: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category.
Biomarker_Result: Terms or values that are identified as the result of biomarkers.
Imaging_Test: Imaging tests mentioned in texts, such as “chest CT scan”.
Oncogene: Mentions of genes that are implicated in the etiology of cancer.
Pathology_Test: Mentions of biopsies or tests that use tissue samples.

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_ocr_typo	235	213	1937	1959	80%	89%	90%
add_typo	101	103	2058	2057	80%	95%	95%
number_to_word	87	82	832	837	80%	91%	91%
swap_entities	149	126	492	505	80%	77%	80%
titlecase	488	184	1761	2065	80%	78%	92%
uppercase	645	185	1612	2072	80%	71%	92%
weighted average	1470	680	6755	7536	80%	82.13%	91.72%

Predicted Entities

Biomarker, Biomarker_Result, Imaging_Test, Oncogene, Pathology_Test

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")                

ner = MedicalNerModel.pretrained("ner_oncology_test_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

data = spark.createDataFrame([["A biopsy was conducted using an ultrasound guided thick needle. His chest computed tomography (CT scan) was negative."]]).toDF("text")

result = pipeline.fit(data).transform(data)

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")                
    
val ner = MedicalNerModel.pretrained("ner_oncology_test_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

        
val pipeline = new Pipeline().setStages(Array(document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter))    

val data = Seq("A biopsy was conducted using an ultrasound guided thick needle. His chest computed tomography (CT scan) was negative.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+-------------------------+--------------+
|chunk                    |ner_label     |
+-------------------------+--------------+
|biopsy                   |Pathology_Test|
|ultrasound               |Imaging_Test  |
|chest computed tomography|Imaging_Test  |
|CT scan                  |Imaging_Test  |
+-------------------------+--------------+

Model Information

Model Name:	ner_oncology_test_langtest
Compatibility:	Healthcare NLP 5.1.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.8 MB

References

In-house annotated oncology case reports.

Benchmarking

label               precision  recall  f1-score  support 
B-Pathology_Test    0.78       0.80    0.79      254     
I-Pathology_Test    0.83       0.75    0.79      263     
B-Imaging_Test      0.85       0.89    0.87      414     
I-Imaging_Test      0.78       0.91    0.84      633     
B-Biomarker_Result  0.82       0.80    0.81      289     
I-Biomarker_Result  0.83       0.82    0.83      374     
B-Biomarker         0.83       0.81    0.82      374     
I-Biomarker         0.85       0.70    0.77      375     
B-Oncogene          0.79       0.86    0.82      100     
I-Oncogene          0.81       0.82    0.81      184     
micro-avg           0.82       0.82    0.82      3260    
macro-avg           0.82       0.81    0.81      3260    
weighted-avg        0.82       0.82    0.82      3260    

PREVIOUSDetect Entities Related to Cancer Diagnosis (LangTest)

NEXTDetect Entities Related to Cancer Therapies (LangTest)