Extract Entities Related to TNM Staging

Description

This model extracts staging information and mentions related to tumors, lymph nodes and metastases.

Definitions of Predicted Entities:

  • Cancer_Dx: Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction.
  • Lymph_Node: Mentions of lymph nodes and pathological findings of the lymph nodes.
  • Lymph_Node_Modifier: Words that refer to a lymph node being abnormal (such as “enlargement”).
  • Metastasis: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
  • Staging: Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”.
  • Tumor: All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”).
  • Tumor_Description: Information related to tumor characteristics, such as size, presence of invasion, grade and hystological type.

Predicted Entities

Cancer_Dx, Lymph_Node, Lymph_Node_Modifier, Metastasis, Staging, Tumor, Tumor_Description

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")                

ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

data = spark.createDataFrame([["The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2."]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")                
    
val ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

        
val pipeline = new Pipeline().setStages(Array(document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter))    

val data = Seq("The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.oncology_tnm").predict("""The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.""")

Results

| chunk            | ner_label         |
|:-----------------|:------------------|
| metastatic       | Metastasis        |
| breast carcinoma | Cancer_Dx         |
| T2N1M1 stage IV  | Staging           |
| 4 cm             | Tumor_Description |
| tumor            | Tumor             |
| grade 2          | Tumor_Description |

Model Information

Model Name: ner_oncology_tnm
Compatibility: Spark NLP for Healthcare 4.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 34.2 MB
Dependencies: embeddings_clinical

References

In-house annotated oncology case reports.

Benchmarking

              label   tp  fp  fn  total  precision  recall   f1
         Lymph_Node  570  77  77    647       0.88    0.88 0.88
            Staging  232  22  26    258       0.91    0.90 0.91
Lymph_Node_Modifier   30   5   5     35       0.86    0.86 0.86
  Tumor_Description 2651 581 490   3141       0.82    0.84 0.83
              Tumor 1116  72 141   1257       0.94    0.89 0.91
         Metastasis  358  15  12    370       0.96    0.97 0.96
          Cancer_Dx 1302  87  92   1394       0.94    0.93 0.94
          macro_avg 6259 859 843   7102       0.90    0.90 0.90
          micro_avg 6259 859 843   7102       0.88    0.88 0.88