Extract Entities Related to TNM Staging

Description

This model extracts staging information and mentions related to tumors, lymph nodes and metastases. Tumor_Description is used to extract characteristics from tumors such as size, histological type or presence of invasion. Lymph_Node_Modifier is used to extract modifiers that refer to an abnormal lymph node (such as “enlarged”).

Definitions of Predicted Entities:

  • Cancer_Dx: Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction.
  • Lymph_Node: Mentions of lymph nodes and pathological findings of the lymph nodes.
  • Lymph_Node_Modifier: Words that refer to a lymph node being abnormal (such as “enlargement”).
  • Metastasis: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
  • Staging: Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”.
  • Tumor: All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”).
  • Tumor_Description: Information related to tumor characteristics, such as size, presence of invasion, grade and hystological type.

Predicted Entities

Cancer_Dx, Lymph_Node, Lymph_Node_Modifier, Metastasis, Staging, Tumor, Tumor_Description

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")                

ner = MedicalNerModel.pretrained("ner_oncology_tnm_wip", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

data = spark.createDataFrame([["The final diagnosis was metastatic breast carcinoma, and the TNM classification was T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2."]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")                
    
val ner = MedicalNerModel.pretrained("ner_oncology_tnm_wip", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

        
val pipeline = new Pipeline().setStages(Array(document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter))    

val data = Seq("The final diagnosis was metastatic breast carcinoma, and the TNM classification was T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.oncology_tnm_wip").predict("""The final diagnosis was metastatic breast carcinoma, and the TNM classification was T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.""")

Results

| chunk              | ner_label         |
|:-------------------|:------------------|
| metastatic         | Metastasis        |
| breast carcinoma   | Cancer_Dx         |
| T2N1M1 stage IV    | Staging           |
| histological grade | Tumor_Description |
| 4 cm               | Tumor_Description |
| tumor              | Tumor             |
| grade 2            | Tumor_Description |

Model Information

Model Name: ner_oncology_tnm_wip
Compatibility: Healthcare NLP 4.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 858.6 KB

References

In-house annotated oncology case reports.

Benchmarking

              label     tp    fp    fn  total  precision  recall   f1
         Lymph_Node  410.0  31.0 100.0  510.0       0.93    0.80 0.86
            Staging  166.0  15.0  50.0  216.0       0.92    0.77 0.84
Lymph_Node_Modifier   19.0   1.0  12.0   31.0       0.95    0.61 0.75
  Tumor_Description 1996.0 537.0 385.0 2381.0       0.79    0.84 0.81
              Tumor  834.0  48.0 108.0  942.0       0.95    0.89 0.91
         Metastasis  273.0  16.0  16.0  289.0       0.94    0.94 0.94
          Cancer_Dx  949.0  44.0 117.0 1066.0       0.96    0.89 0.92
          macro_avg 4647.0 692.0 788.0 5435.0       0.92    0.82 0.86
          micro_avg    NaN   NaN   NaN    NaN       0.88    0.86 0.86