Detect Genes and Human Phenotypes (LangTest)

Description

This model detects mentions of genes and human phenotypes (hp) in medical text. It is the version of ner_human_phenotype_gene_clinical model augmented with langtest library.

test_type before fail_count after fail_count before pass_count after pass_count minimum pass_rate before pass_rate after pass_rate
add_ocr_typo 175 120 606 661 85% 78% 85%
lowercase 262 110 521 673 85% 67% 86%
swap_entities 123 112 600 614 80% 83% 85%
titlecase 704 155 79 628 75% 10% 80%
uppercase 709 174 74 609 75% 9% 78%
weighted average 1973 671 1880 3185 80% 48.79% 82.60%

Predicted Entities

GENE, HP

Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
     
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")
    
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

result = model.transform(spark.createDataFrame([["We will systematically examine seven genes (CHN1, MDH1, PCP4, RTN1, SLC14A1, SNAP25 and VSNL1) that are altered in the three neurodegenerative diseases."]]).toDF("text"))
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols("document") 
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("""We will systematically examine seven genes (CHN1, MDH1, PCP4, RTN1, SLC14A1, SNAP25 and VSNL1) that are altered in the three neurodegenerative diseases.""").toDS().toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|CHN1                      |GENE     |
|MDH1                      |GENE     |
|PCP4                      |GENE     |
|RTN1                      |GENE     |
|SLC14A1                   |GENE     |
|SNAP25                    |GENE     |
|VSNL1                     |GENE     |
|neurodegenerative diseases|HP       |
+--------------------------+---------+

Model Information

Model Name: ner_human_phenotype_gene_clinical_langtest
Compatibility: Healthcare NLP 5.1.1+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.7 MB

Benchmarking

label         precision  recall  f1-score  support 
GENE          0.85       0.89    0.87      1082    
HP            0.89       0.88    0.88      878     
micro-avg     0.87       0.88    0.87      1960    
macro-avg     0.87       0.88    0.87      1960    
weighted-avg  0.87       0.88    0.87      1960