Detect Normalized Genes and Human Phenotypes (biobert)

Description

This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text.

Predicted Entities

HP, GO

Live Demo Open in Colab Download

How to use

...
embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased").setInputCols(["sentence", "token"]).setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_go_biobert", "en", "clinical/models").setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))


...
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_human_phenotype_go_biobert", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))

val result = pipeline.fit(Seq.empty[String]).transform(data)

Model Information

Model Name: ner_human_phenotype_go_biobert
Compatibility: Spark NLP for Healthcare 3.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en

Benchmarking

entity      tp     fp     fn   total  precision  recall      f1
    GO  7637.0  579.0  441.0  8078.0     0.9295  0.9454  0.9374
    HP  1463.0  273.0  222.0  1685.0     0.8427  0.8682  0.8553
 macro     -      -      -       -         -       -     0.89635
 micro     -      -      -       -         -       -     0.92323