Detect Normalized Genes and Human Phenotypes

Description

This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text.

Predicted Entities

GO, HP

Live Demo Open in Colab Copy S3 URI

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = light_pipeline.fullAnnotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.")

...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings") 
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("en.med_ner.human_phenotype.go_clinical").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""")

Results

+----+--------------------------+---------+-------+----------+
|    | chunk                    |   begin |   end | entity   |
+====+==========================+=========+=======+==========+
|  0 | tumor                    |      39 |    43 | HP       |
+----+--------------------------+---------+-------+----------+
|  1 | tricarboxylic acid cycle |      79 |   102 | GO       |
+----+--------------------------+---------+-------+----------+

Model Information

Model Name: ner_human_phenotype_go_clinical
Type: ner
Compatibility: Healthcare NLP 2.6.0 +
Edition: Official
License: Licensed
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: [en]
Case sensitive: false
Dependencies: embeddings_clinical

Benchmarking

|    | label         |    tp |   fp |   fn |     prec |      rec |       f1 |
|---:|--------------:|------:|-----:|-----:|---------:|---------:|---------:|
|  0 | B-GO          | 1530  |  129 |   57 | 0.922242 | 0.964083 | 0.942699 |
|  1 | B-HP          |  950  |  133 |  130 | 0.877193 |  0.87963 |  0.87841 |
|  2 | I-HP          |  253  |   46 |   68 | 0.846154 | 0.788162 | 0.816129 |
|  3 | I-GO          | 4550  |  344 |  154 |  0.92971 | 0.967262 | 0.948114 |
|  4 | Macro-average | 7283  |  652 |  409 | 0.893825 | 0.899784 | 0.896795 |
|  5 | Micro-average | 7283  |  652 |  409 | 0.917832 | 0.946828 | 0.932105 |