Description
This is a pre-trained model that can be used to automatically detect all chemical compounds and gene mentions from medical texts. It is the version of ner_chemprot_clinical model augmented with langtest
library.
test_type | before fail_count | after fail_count | before pass_count | after pass_count | minimum pass_rate | before pass_rate | after pass_rate |
---|---|---|---|---|---|---|---|
add_ocr_typo | 798 | 157 | 1096 | 1737 | 70% | 58% | 92% |
lowercase | 745 | 426 | 1160 | 1479 | 70% | 61% | 78% |
titlecase | 866 | 419 | 1066 | 1513 | 70% | 55% | 78% |
uppercase | 1458 | 418 | 473 | 1513 | 70% | 24% | 78% |
weighted average | 3867 | 1420 | 3795 | 6242 | 70% | 49.53% | 81.47% |
Predicted Entities
CHEMICAL
, GENE-Y
, GENE-N
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_chemprot_clinical_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = model.transform(spark.createDataFrame([["Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium."]]).toDF("text"))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_chemprot_clinical_langtest", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|Keratinocyte growth factor |GENE-Y |
|acidic fibroblast growth factor|GENE-Y |
+-------------------------------+---------+
Model Information
Model Name: | ner_chemprot_clinical_langtest |
Compatibility: | Healthcare NLP 5.1.1+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.6 MB |
References
This model was trained on the ChemProt corpus using ‘embeddings_clinical’ embeddings. Make sure you use the same embeddings when running the model.
Benchmarking
label precision recall f1-score support
CHEMICAL 0.92 0.93 0.92 2530
GENE-N 0.75 0.67 0.71 984
GENE-Y 0.84 0.87 0.86 1751
micro-avg 0.86 0.86 0.86 5265
macro-avg 0.84 0.82 0.83 5265
weighted-avg 0.86 0.86 0.86 5265