Detect Entities Related to Cancer Diagnosis (LangTest)

Description

This model extracts entities related to cancer diagnosis, such as Metastasis, Histological_Type, or Invasion. It is the version of ner_oncology_diagnosis model augmented with langtest library.

Definitions of Predicted Entities:

Adenopathy: Mentions of pathological findings of the lymph nodes.
Cancer_Dx: Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction.
Cancer_Score: Clinical or imaging scores that are specific for cancer settings (e.g. “BI-RADS” or “Allred score”).
Grade: All pathological grading of tumors (e.g. “grade 1”) or degrees of cellular differentiation (e.g. “well-differentiated”)
Histological_Type: Histological variants or cancer subtypes, such as “papillary”, “clear cell” or “medullary”.
Invasion: Mentions that refer to tumor invasion, such as “invasion” or “involvement”. Metastases or lymph node involvement are excluded from this category.
Metastasis: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
Pathology_Result: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. “malignant ductal cells”).
Performance_Status: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. “ECOG performance status of 4”).
Staging: Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”.
Tumor_Finding: All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”).
Tumor_Size: Size of the tumor, including numerical value and unit of measurement (e.g. “3 cm”).

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_punctuation	4	3	183	184	60%	98%	98%
swap_entities	272	276	1241	1252	60%	82%	82%
titlecase	910	834	3592	3668	80%	80%	81%
uppercase	1768	606	2747	3909	80%	61%	87%
weighted average	2954	1719	7763	9013	70%	72.44%	83.98%

Predicted Entities

Adenopathy, Cancer_Dx, Cancer_Score, Grade, Histological_Type, Invasion, Metastasis, Pathology_Result, Performance_Status, Staging, Tumor_Finding, Tumor_Size

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")                

ner = MedicalNerModel.pretrained("ner_oncology_diagnosis_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

data = spark.createDataFrame([["Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis."]]).toDF("text")

result = pipeline.fit(data).transform(data)

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")                
    
val ner = MedicalNerModel.pretrained("ner_oncology_diagnosis_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

        
val pipeline = new Pipeline().setStages(Array(document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter))    

val data = Seq("Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+------------+-----------------+
|chunk       |ner_label        |
+------------+-----------------+
|tumor       |Tumor_Finding    |
|adenopathies|Adenopathy       |
|invasive    |Histological_Type|
|ductal      |Histological_Type|
|carcinoma   |Cancer_Dx        |
|metastasis  |Metastasis       |
+------------+-----------------+

Model Information

Model Name:	ner_oncology_diagnosis_langtest
Compatibility:	Healthcare NLP 5.1.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.8 MB

References

In-house annotated oncology case reports.

Benchmarking

label               precision  recall  f1-score  support 
Adenopathy          0.65       0.82    0.73      34      
Cancer_Dx           0.85       0.95    0.90      721     
Cancer_Score        0.86       0.64    0.73      28      
Grade               0.57       0.73    0.64      82      
Histological_Type   0.73       0.87    0.79      242     
Invasion            0.79       0.94    0.86      172     
Metastasis          0.82       0.98    0.89      321     
Pathology_Result    0.44       0.70    0.54      348     
Performance_Status  0.33       0.94    0.49      31      
Staging             0.90       0.94    0.92      123     
Tumor_Finding       0.87       0.94    0.90      1018    
Tumor_Size          0.68       0.93    0.78      216     
micro-avg           0.75       0.91    0.82      3336    
macro-avg           0.71       0.87    0.77      3336    
weighted-avg        0.77       0.91    0.83      3336    

PREVIOUSExtract Biomarkers and their Results (LangTest)

NEXTExtract Oncology Tests (LangTest)