Extract Entities Related to TNM Staging (LangTest)

Description

This model extracts staging information and mentions related to tumors, lymph nodes, and metastases. It is the version of ner_oncology_tnm model augmented with langtest library.

Definitions of Predicted Entities:

  • Cancer_Dx: Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction.
  • Lymph_Node: Mentions of lymph nodes and pathological findings of the lymph nodes.
  • Lymph_Node_Modifier: Words that refer to a lymph node being abnormal (such as “enlargement”).
  • Metastasis: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
  • Staging: Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”.
  • Tumor: All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”).
  • Tumor_Description: Information related to tumor characteristics, such as size, presence of invasion, grade, and hystological type.
test_type before fail_count after fail_count before pass_count after pass_count minimum pass_rate before pass_rate after pass_rate
add_typo 249 209 4089 4146 70% 94% 95%
lowercase 273 238 4243 4278 70% 94% 95%
swap_entities 362 342 1192 1211 70% 77% 78%
titlecase 985 712 3517 3790 70% 78% 84%
uppercase 1779 378 2736 4137 70% 61% 92%
weighted average 3648 1879 15777 17562 70% 81.22% 90.33%

Predicted Entities

Cancer_Dx, Lymph_Node, Lymph_Node_Modifier, Metastasis, Staging, Tumor, Tumor_Description

Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")                

ner = MedicalNerModel.pretrained("ner_oncology_tnm_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

data = spark.createDataFrame([["Chest computed tomography (CT) showed pulmonary lesions in the posterior segment of the right upper lobe, and peripheral lung cancer with multiple pulmonary metastases. Multiple metastases of the thoracic vertebrae, sternum, and ribs were considered, which were similar to previous CT images."]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")                
    
val ner = MedicalNerModel.pretrained("ner_oncology_tnm_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

        
val pipeline = new Pipeline().setStages(Array(document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter))    

val data = Seq("Chest computed tomography (CT) showed pulmonary lesions in the posterior segment of the right upper lobe, and peripheral lung cancer with multiple pulmonary metastases. Multiple metastases of the thoracic vertebrae, sternum, and ribs were considered, which were similar to previous CT images.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+----------------------+----------+
|chunk                 |ner_label |
+----------------------+----------+
|lesions               |Tumor     |
|peripheral lung cancer|Cancer_Dx |
|metastases            |Metastasis|
|metastases            |Metastasis|
+----------------------+----------+

Model Information

Model Name: ner_oncology_tnm_langtest
Compatibility: Healthcare NLP 5.1.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.8 MB

References

In-house annotated oncology case reports.

Benchmarking

label                precision  recall  f1-score  support 
Cancer_Dx            0.88       0.86    0.87      721     
Lymph_Node           0.78       0.77    0.77      271     
Lymph_Node_Modifier  0.79       0.76    0.77      45      
Metastasis           0.92       0.92    0.92      321     
Staging              0.88       0.80    0.84      123     
Tumor                0.90       0.86    0.88      1018    
Tumor_Description    0.70       0.68    0.69      1060    
micro-avg            0.83       0.80    0.81      3559    
macro-avg            0.84       0.81    0.82      3559    
weighted-avg         0.83       0.80    0.81      3559