Description
This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text. It is the version of ner_human_phenotype_go_clinical model augmented with langtest
library.
test_type | before fail_count | after fail_count | before pass_count | after pass_count | minimum pass_rate | before pass_rate | after pass_rate |
---|---|---|---|---|---|---|---|
add_abbreviation | 116 | 53 | 475 | 538 | 85% | 80% | 91% |
add_ocr_typo | 517 | 93 | 192 | 616 | 70% | 27% | 87% |
add_typo | 201 | 92 | 480 | 594 | 75% | 70% | 87% |
lowercase | 71 | 44 | 622 | 649 | 90% | 90% | 94% |
titlecase | 701 | 123 | 8 | 586 | 70% | 1% | 83% |
uppercase | 707 | 186 | 2 | 523 | 70% | 0% | 74% |
weighted average | 2313 | 591 | 1779 | 3506 | 77% | 43.48% | 85.57% |
Predicted Entities
GO
, HP
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])
data = spark.createDataFrame([["""Another disease that shares two of the tumor components of CT, namely GIST and the tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical_langtest", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.").toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+------------------------+---------+
|chunk |ner_label|
+------------------------+---------+
|tumor |HP |
|tricarboxylic acid cycle|GO |
+------------------------+---------+
Model Information
Model Name: | ner_human_phenotype_go_clinical_langtest |
Compatibility: | Healthcare NLP 5.1.1+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.6 MB |
Benchmarking
label precision recall f1-score support
GO 0.89 0.81 0.85 1363
HP 0.84 0.85 0.85 762
micro-avg 0.87 0.82 0.85 2125
macro-avg 0.86 0.83 0.85 2125
weighted-avg 0.87 0.82 0.85 2125