Description
This model is trained to extract biomarkers, therapies, oncological, and other general concepts from text. It is the version of ner_biomarker model augmented with langtest
library.
test_type | before fail_count | after fail_count | before pass_count | after pass_count | minimum pass_rate | before pass_rate | after pass_rate |
---|---|---|---|---|---|---|---|
add_ocr_typo | 1409 | 589 | 2209 | 3029 | 70% | 61% | 84% |
add_typo | 698 | 647 | 2937 | 2966 | 70% | 81% | 82% |
lowercase | 1913 | 606 | 1791 | 3098 | 70% | 48% | 84% |
titlecase | 2541 | 953 | 1227 | 2815 | 70% | 33% | 75% |
uppercase | 3515 | 1111 | 245 | 2649 | 70% | 7% | 70% |
weighted average | 10076 | 3906 | 8409 | 14557 | 70% | 45.49% | 78.84% |
Predicted Entities
Oncogenes
, Tumor_Finding
, UnspecificTherapy
, Ethnicity
, Age
, ResponseToTreatment
, Biomarker
, HormonalTherapy
, Staging
, Drug
, CancerDx
, Radiotherapy
, CancerSurgery
, TargetedTherapy
, PerformanceStatus
, CancerModifier
, Radiological_Test_Result
, Biomarker_Measurement
, Metastasis
, Radiological_Test
, Chemotherapy
, Test
, Dosage
, Test_Result
, Immunotherapy
, Date
, Gender
, Prognostic_Biomarkers
, Duration
, Predictive_Biomarkers
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \
.setInputCols(['sentence', 'token']) \
.setOutputCol('embeddings')
clinical_ner = MedicalNerModel.pretrained("ner_biomarker_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = model.transform(spark.createDataFrame([["Immunohistochemistry (IHC) showed that the primary metaplastic SCC was ER negative, PR negative, HER-2 positive (3+), EGFR (1+), CK5/6 (2+), 34βE12 (2+), P63 positive (the P63 positive cells were located around the cancer nest), P53 negative, SMMHC positive (1+), E-cadherin (2+), and Ki-67 was positive in 95% tumor cells."]], ["text"]))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_biomarker_langtest", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("""Immunohistochemistry (IHC) showed that the primary metaplastic SCC was ER negative, PR negative, HER-2 positive (3+), EGFR (1+), CK5/6 (2+), 34βE12 (2+), P63 positive (the P63 positive cells were located around the cancer nest), P53 negative, SMMHC positive (1+), E-cadherin (2+), and Ki-67 was positive in 95% tumor cells.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+--------------------+---------------------+
|chunk |ner_label |
+--------------------+---------------------+
|Immunohistochemistry|Test |
|IHC |Test |
|metaplastic |CancerModifier |
|SCC |CancerDx |
|ER |Biomarker |
|negative |Biomarker_Measurement|
|PR |Biomarker |
|negative |Biomarker_Measurement|
|HER-2 |Biomarker |
|positive |Biomarker_Measurement|
|3+ |Biomarker_Measurement|
|EGFR |Oncogenes |
|1+ |Biomarker_Measurement|
|CK5/6 |Biomarker |
|2+ |Biomarker_Measurement|
|34βE12 |Biomarker |
|2+ |Biomarker_Measurement|
|P63 |Biomarker |
|positive |Biomarker_Measurement|
|P63 |Biomarker |
|positive |Biomarker_Measurement|
|cancer |CancerDx |
|P53 |Biomarker |
|negative |Biomarker_Measurement|
|SMMHC |Biomarker |
|positive |Biomarker_Measurement|
|1+ |Biomarker_Measurement|
|E-cadherin |Biomarker |
|2+ |Biomarker_Measurement|
|Ki-67 |Biomarker |
|positive |Biomarker_Measurement|
|95% |Biomarker_Measurement|
|tumor |Tumor_Finding |
+--------------------+---------------------+
Model Information
Model Name: | ner_biomarker_langtest |
Compatibility: | Healthcare NLP 5.1.1+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.8 MB |
References
Trained on data sampled from Mimic-III, and annotated in-house.
Benchmarking
label precision recall f1-score support
Age 0.81 0.93 0.87 46
Biomarker 0.80 0.84 0.82 3491
Biomarker_Measurement 0.77 0.76 0.76 733
CancerDx 0.92 0.93 0.93 2873
CancerModifier 0.77 0.84 0.80 1065
CancerSurgery 0.84 0.88 0.86 282
Chemotherapy 0.93 0.93 0.93 665
Date 0.86 0.74 0.80 136
Dosage 0.62 0.40 0.48 133
Drug 0.69 0.65 0.67 37
Duration 0.49 0.35 0.41 66
Ethnicity 0.67 1.00 0.80 6
Gender 0.76 0.86 0.80 84
HormonalTherapy 0.84 0.84 0.84 135
Immunotherapy 0.91 0.91 0.91 317
Metastasis 0.93 0.91 0.92 568
Oncogenes 0.66 0.56 0.60 670
PerformanceStatus 0.50 0.43 0.46 7
Predictive_Biomarkers 0.00 0.00 0.00 1
Prognostic_Biomarkers 0.41 0.20 0.27 64
Radiological_Test 0.65 0.64 0.65 101
Radiological_Test_Result 0.14 0.08 0.10 13
Radiotherapy 0.93 0.93 0.93 177
ResponseToTreatment 0.71 0.14 0.23 36
Staging 0.65 0.66 0.65 82
TargetedTherapy 0.90 0.89 0.89 765
Test 0.68 0.68 0.68 445
Test_Result 0.31 0.13 0.18 31
Tumor_Finding 0.73 0.83 0.78 728
UnspecificTherapy 0.69 0.82 0.75 79
micro-avg 0.82 0.83 0.83 13836
macro-avg 0.68 0.66 0.66 13836
weighted-avg 0.82 0.83 0.82 13836