Extract Entities Related to TNM Staging (LangTest)

Description

This model extracts staging information and mentions related to tumors, lymph nodes, and metastases. It is the version of ner_oncology_tnm model augmented with langtest library.

Definitions of Predicted Entities:

Cancer_Dx: Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction.
Lymph_Node: Mentions of lymph nodes and pathological findings of the lymph nodes.
Lymph_Node_Modifier: Words that refer to a lymph node being abnormal (such as “enlargement”).
Metastasis: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
Staging: Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”.
Tumor: All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”).
Tumor_Description: Information related to tumor characteristics, such as size, presence of invasion, grade, and hystological type.

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_typo	249	209	4089	4146	70%	94%	95%
lowercase	273	238	4243	4278	70%	94%	95%
swap_entities	362	342	1192	1211	70%	77%	78%
titlecase	985	712	3517	3790	70%	78%	84%
uppercase	1779	378	2736	4137	70%	61%	92%
weighted average	3648	1879	15777	17562	70%	81.22%	90.33%

Predicted Entities

Cancer_Dx, Lymph_Node, Lymph_Node_Modifier, Metastasis, Staging, Tumor, Tumor_Description

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")                

ner = MedicalNerModel.pretrained("ner_oncology_tnm_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

data = spark.createDataFrame([["Chest computed tomography (CT) showed pulmonary lesions in the posterior segment of the right upper lobe, and peripheral lung cancer with multiple pulmonary metastases. Multiple metastases of the thoracic vertebrae, sternum, and ribs were considered, which were similar to previous CT images."]]).toDF("text")

result = pipeline.fit(data).transform(data)

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")                
    
val ner = MedicalNerModel.pretrained("ner_oncology_tnm_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

        
val pipeline = new Pipeline().setStages(Array(document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter))    

val data = Seq("Chest computed tomography (CT) showed pulmonary lesions in the posterior segment of the right upper lobe, and peripheral lung cancer with multiple pulmonary metastases. Multiple metastases of the thoracic vertebrae, sternum, and ribs were considered, which were similar to previous CT images.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+----------------------+----------+
|chunk                 |ner_label |
+----------------------+----------+
|lesions               |Tumor     |
|peripheral lung cancer|Cancer_Dx |
|metastases            |Metastasis|
|metastases            |Metastasis|
+----------------------+----------+

Model Information

Model Name:	ner_oncology_tnm_langtest
Compatibility:	Healthcare NLP 5.1.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.8 MB

References

In-house annotated oncology case reports.

Benchmarking

label                precision  recall  f1-score  support 
Cancer_Dx            0.88       0.86    0.87      721     
Lymph_Node           0.78       0.77    0.77      271     
Lymph_Node_Modifier  0.79       0.76    0.77      45      
Metastasis           0.92       0.92    0.92      321     
Staging              0.88       0.80    0.84      123     
Tumor                0.90       0.86    0.88      1018    
Tumor_Description    0.70       0.68    0.69      1060    
micro-avg            0.83       0.80    0.81      3559    
macro-avg            0.84       0.81    0.82      3559    
weighted-avg         0.83       0.80    0.81      3559    

PREVIOUSDetect Entities Related to Cancer Therapies (LangTest)

NEXTExtract Cancer Therapies and Posology Information (LangTest)