Detect Genes and Human Phenotypes (biobert)

Description

This model can be used to detect normalized mentions of genes (gene) and human phenotypes (hp) in medical text.

Predicted Entities

HP, GENE

Live Demo Open in Colab Download

How to use

...
embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased").setInputCols(["sentence", "token"]).setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_biobert", "en", "clinical/models").setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text"))


...
val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_biobert", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))

val result = pipeline.fit(Seq.empty[String]).transform(data)

Model Information

Model Name: ner_human_phenotype_gene_biobert
Compatibility: Spark NLP for Healthcare 3.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en

Benchmarking

entity     tp    fp    fn  total precision recall     f1
    HP 1761.0 198.0 342.0 2103.0    0.8989 0.8374 0.8671
  GENE 1600.0 290.0 361.0 1961.0    0.8466 0.8159  0.831
 macro    -     -     -      -        -      -    0.8490
 micro    -     -     -      -        -      -    0.8496