Description
This model is trained to extract biomarkers, therapies, oncological, and other general concepts from text.
Predicted Entities
Oncogenes
, Tumor_Finding
, UnspecificTherapy
, Ethnicity
, Age
, ResponseToTreatment
, Biomarker
, HormonalTherapy
, Staging
, Drug
, CancerDx
, Radiotherapy
, CancerSurgery
, TargetedTherapy
, PerformanceStatus
, CancerModifier
, Radiological_Test_Result
, Biomarker_Measurement
, Metastasis
, Radiological_Test
, Chemotherapy
, Test
, Dosage
, Test_Result
, Immunotherapy
, Date
, Gender
, Prognostic_Biomarkers
, Duration
, Predictive_Biomarkers
Live Demo Open in Colab Copy S3 URI
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \
.setInputCols(['sentence', 'token']) \
.setOutputCol('embeddings')
clinical_ner = MedicalNerModel.pretrained("ner_biomarker", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin "]], ["text"]))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_biomarker", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))
val data = Seq("""Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin """).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.biomarker").predict("""Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin """)
Results
| | ner_chunk | entity | confidence |
|---:|:-------------------------|:----------------------|-------------:|
| 0 | intraductal | CancerModifier | 0.9934 |
| 1 | tubulopapillary | CancerModifier | 0.6403 |
| 2 | neoplasm of the pancreas | CancerDx | 0.758825 |
| 3 | clear cell | CancerModifier | 0.9633 |
| 4 | Immunohistochemistry | Test | 0.9534 |
| 5 | positivity | Biomarker_Measurement | 0.8795 |
| 6 | Pan-CK | Biomarker | 0.9975 |
| 7 | CK7 | Biomarker | 0.9975 |
| 8 | CK8/18 | Biomarker | 0.9987 |
| 9 | MUC1 | Biomarker | 0.9967 |
| 10 | MUC6 | Biomarker | 0.9972 |
| 11 | carbonic anhydrase IX | Biomarker | 0.937567 |
| 12 | CD10 | Biomarker | 0.9974 |
| 13 | EMA | Biomarker | 0.9899 |
| 14 | β-catenin | Biomarker | 0.8059 |
| 15 | e-cadherin | Biomarker | 0.9806 |
Model Information
Model Name: | ner_biomarker |
Compatibility: | Healthcare NLP 3.3.3+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Data Source
Trained on data sampled from Mimic-III, and annotated in-house.
Benchmarking
label tp fp fn prec rec f1
I-Oncogenes 73 65 64 0.5289855 0.5328467 0.53090906
B-Radiotherapy 130 8 12 0.942029 0.91549295 0.9285714
B-Chemotherapy 644 31 28 0.9540741 0.9583333 0.956199
I-Radiotherapy 72 4 8 0.94736844 0.9 0.92307687
B-Predictive_Biomarkers 0 0 2 0.0 0.0 0.0
I-Staging 71 11 30 0.86585367 0.7029703 0.77595633
B-Radiological_Test_Result 0 3 20 0.0 0.0 0.0
B-Drug 18 10 19 0.64285713 0.4864865 0.5538461
B-Dosage 123 20 28 0.86013985 0.81456953 0.8367347
I-Test_Result 22 11 44 0.6666667 0.33333334 0.44444448
I-CancerModifier 349 41 86 0.8948718 0.80229884 0.8460606
I-Predictive_Biomarkers 0 0 1 0.0 0.0 0.0
B-Date 131 19 34 0.87333333 0.7939394 0.831746
B-HormonalTherapy 114 5 12 0.9579832 0.9047619 0.9306123
B-Radiological_Test 105 38 21 0.73426574 0.8333333 0.78066915
B-Ethnicity 8 0 1 1.0 0.8888889 0.94117653
I-Radiological_Test 69 50 15 0.57983196 0.8214286 0.67980295
I-UnspecificTherapy 59 8 6 0.880597 0.9076923 0.8939394
I-Immunotherapy 100 25 22 0.8 0.8196721 0.80971664
B-UnspecificTherapy 92 16 12 0.8518519 0.88461536 0.8679245
I-ResponseToTreatment 5 18 76 0.2173913 0.061728396 0.09615384
B-ResponseToTreatment 6 18 38 0.25 0.13636364 0.1764706
B-Test_Result 23 17 20 0.575 0.53488374 0.55421686
I-Biomarker_Measurement 47 46 61 0.50537634 0.4351852 0.4676617
B-Test 286 145 138 0.6635731 0.6745283 0.6690058
B-TargetedTherapy 675 74 75 0.9012016 0.9 0.9006004
I-Biomarker 732 250 237 0.74541754 0.75541794 0.75038445
I-Radiological_Test_Result 8 6 86 0.5714286 0.08510638 0.14814815
B-CancerSurgery 194 29 34 0.8699552 0.85087717 0.86031044
I-Duration 37 47 57 0.44047618 0.39361703 0.41573036
B-Oncogenes 342 118 229 0.74347824 0.5989492 0.66343355
I-CancerDx 1272 131 123 0.90662867 0.911828 0.9092209
I-Age 19 4 4 0.82608694 0.82608694 0.826087
B-Immunotherapy 300 29 16 0.9118541 0.9493671 0.9302325
I-Prognostic_Biomarkers 4 3 7 0.5714286 0.36363637 0.44444445
B-Tumor_Finding 574 225 141 0.718398 0.8027972 0.75825626
B-CancerDx 2620 205 169 0.9274336 0.9394048 0.9333808
I-TargetedTherapy 317 70 38 0.8191214 0.89295775 0.8544474
B-Gender 52 14 10 0.7878788 0.83870965 0.81250006
B-Metastasis 584 41 44 0.9344 0.9299363 0.9321628
I-Dosage 69 16 19 0.8117647 0.78409094 0.7976879
B-CancerModifier 852 135 166 0.8632219 0.83693516 0.84987533
B-Staging 71 27 23 0.7244898 0.7553192 0.7395834
I-Tumor_Finding 79 58 92 0.57664233 0.4619883 0.512987
I-Test 168 96 123 0.6363636 0.57731956 0.60540545
B-Age 42 7 6 0.85714287 0.875 0.8659794
I-HormonalTherapy 54 7 3 0.8852459 0.94736844 0.91525424
B-PerformanceStatus 11 2 0 0.84615386 1.0 0.9166667
I-Chemotherapy 60 6 9 0.90909094 0.8695652 0.8888889
I-Date 116 15 9 0.8854962 0.928 0.90625
B-Prognostic_Biomarkers 33 11 35 0.75 0.4852941 0.58928573
B-Duration 30 50 38 0.375 0.44117647 0.40540543
I-Metastasis 32 14 45 0.6956522 0.41558442 0.5203252
B-Biomarker_Measurement 437 124 175 0.7789661 0.71405226 0.745098
I-CancerSurgery 128 17 30 0.8827586 0.8101266 0.8448845
I-Drug 2 0 8 1.0 0.2 0.3333333
B-Biomarker 3027 571 332 0.8413007 0.9011611 0.8702027
I-PerformanceStatus 37 15 0 0.71153843 1.0 0.83146065
Macro-average 15525 3026 3181 0.7223804 0.675604 0.69820964
Micro-average 15525 3026 3181 0.8368821 0.8299476 0.8334004