Description
This model detects mentions of genes and human phenotypes (hp) in medical text. It is the version of ner_human_phenotype_gene_clinical model augmented with langtest
library.
test_type | before fail_count | after fail_count | before pass_count | after pass_count | minimum pass_rate | before pass_rate | after pass_rate |
---|---|---|---|---|---|---|---|
add_ocr_typo | 175 | 120 | 606 | 661 | 85% | 78% | 85% |
lowercase | 262 | 110 | 521 | 673 | 85% | 67% | 86% |
swap_entities | 123 | 112 | 600 | 614 | 80% | 83% | 85% |
titlecase | 704 | 155 | 79 | 628 | 75% | 10% | 80% |
uppercase | 709 | 174 | 74 | 609 | 75% | 9% | 78% |
weighted average | 1973 | 671 | 1880 | 3185 | 80% | 48.79% | 82.60% |
Predicted Entities
GENE
, HP
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))
result = model.transform(spark.createDataFrame([["We will systematically examine seven genes (CHN1, MDH1, PCP4, RTN1, SLC14A1, SNAP25 and VSNL1) that are altered in the three neurodegenerative diseases."]]).toDF("text"))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical_langtest", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("""We will systematically examine seven genes (CHN1, MDH1, PCP4, RTN1, SLC14A1, SNAP25 and VSNL1) that are altered in the three neurodegenerative diseases.""").toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+--------------------------+---------+
|chunk |ner_label|
+--------------------------+---------+
|CHN1 |GENE |
|MDH1 |GENE |
|PCP4 |GENE |
|RTN1 |GENE |
|SLC14A1 |GENE |
|SNAP25 |GENE |
|VSNL1 |GENE |
|neurodegenerative diseases|HP |
+--------------------------+---------+
Model Information
Model Name: | ner_human_phenotype_gene_clinical_langtest |
Compatibility: | Healthcare NLP 5.1.1+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.7 MB |
Benchmarking
label precision recall f1-score support
GENE 0.85 0.89 0.87 1082
HP 0.89 0.88 0.88 878
micro-avg 0.87 0.88 0.87 1960
macro-avg 0.87 0.88 0.87 1960
weighted-avg 0.87 0.88 0.87 1960