Description
This model extracts mentions of tests from oncology texts, including pathology tests and imaging tests. It is the version of ner_oncology_test model augmented with langtest
library.
Definitions of Predicted Entities:
Biomarker
: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category.Biomarker_Result
: Terms or values that are identified as the result of biomarkers.Imaging_Test
: Imaging tests mentioned in texts, such as “chest CT scan”.Oncogene
: Mentions of genes that are implicated in the etiology of cancer.Pathology_Test
: Mentions of biopsies or tests that use tissue samples.
test_type | before fail_count | after fail_count | before pass_count | after pass_count | minimum pass_rate | before pass_rate | after pass_rate |
---|---|---|---|---|---|---|---|
add_ocr_typo | 235 | 213 | 1937 | 1959 | 80% | 89% | 90% |
add_typo | 101 | 103 | 2058 | 2057 | 80% | 95% | 95% |
number_to_word | 87 | 82 | 832 | 837 | 80% | 91% | 91% |
swap_entities | 149 | 126 | 492 | 505 | 80% | 77% | 80% |
titlecase | 488 | 184 | 1761 | 2065 | 80% | 78% | 92% |
uppercase | 645 | 185 | 1612 | 2072 | 80% | 71% | 92% |
weighted average | 1470 | 680 | 6755 | 7536 | 80% | 82.13% | 91.72% |
Predicted Entities
Biomarker
, Biomarker_Result
, Imaging_Test
, Oncogene
, Pathology_Test
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_test_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["A biopsy was conducted using an ultrasound guided thick needle. His chest computed tomography (CT scan) was negative."]]).toDF("text")
result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_test_langtest", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("A biopsy was conducted using an ultrasound guided thick needle. His chest computed tomography (CT scan) was negative.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+-------------------------+--------------+
|chunk |ner_label |
+-------------------------+--------------+
|biopsy |Pathology_Test|
|ultrasound |Imaging_Test |
|chest computed tomography|Imaging_Test |
|CT scan |Imaging_Test |
+-------------------------+--------------+
Model Information
Model Name: | ner_oncology_test_langtest |
Compatibility: | Healthcare NLP 5.1.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.8 MB |
References
In-house annotated oncology case reports.
Benchmarking
label precision recall f1-score support
B-Pathology_Test 0.78 0.80 0.79 254
I-Pathology_Test 0.83 0.75 0.79 263
B-Imaging_Test 0.85 0.89 0.87 414
I-Imaging_Test 0.78 0.91 0.84 633
B-Biomarker_Result 0.82 0.80 0.81 289
I-Biomarker_Result 0.83 0.82 0.83 374
B-Biomarker 0.83 0.81 0.82 374
I-Biomarker 0.85 0.70 0.77 375
B-Oncogene 0.79 0.86 0.82 100
I-Oncogene 0.81 0.82 0.81 184
micro-avg 0.82 0.82 0.82 3260
macro-avg 0.82 0.81 0.81 3260
weighted-avg 0.82 0.82 0.82 3260