Detect Chemical Compounds and Genes


This is a pre-trained model that can be used to automatically detect all chemical compounds and gene mentions from medical texts.

Predicted Entities


Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
sentence_detector = SentenceDetector()\

tokenizer = Tokenizer()\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\

clinical_ner = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \

ner_converter = NerConverter()\
 	  .setInputCols(["sentence", "token", "ner"])\

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

model =[[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([["Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium."]]).toDF("text"))
val document_assembler = new DocumentAssembler()
val sentence_detector = new SentenceDetector()

val tokenizer = new Tokenizer()

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))

val ner = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))

val ner_converter = new NerConverter()
 	.setInputCols(Array("sentence", "token", "ner"))

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""").toDS().toDF("text")

val result =
import nlu
nlu.load("en.med_ner.chemprot.clinical").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""")


|    | chunk                           |   begin |   end | entity   |
|  0 | Keratinocyte growth factor      |       0 |    25 | GENE-Y   |
|  1 | acidic fibroblast growth factor |      31 |    61 | GENE-Y   |

Model Information

Model Name: ner_chemprot_clinical
Compatibility: Healthcare NLP 3.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en

Data Source

This model was trained on the ChemProt corpus using ‘embeddings_clinical’ embeddings. Make sure you use the same embeddings when running the model.


|    | label         |     tp |    fp |   fn |     prec |      rec |       f1 |
|  0 | B-GENE-Y      |   4650 |  1090 |  838 | 0.810105 | 0.847303 | 0.828286 |
|  1 | B-GENE-N      |   1732 |   981 | 1019 | 0.638408 | 0.629589 | 0.633968 |
|  2 | I-GENE-Y      |   1846 |   571 |  573 | 0.763757 | 0.763125 | 0.763441 |
|  3 | B-CHEMICAL    |   7512 |   804 | 1136 | 0.903319 | 0.86864  | 0.88564  |
|  4 | I-CHEMICAL    |   1059 |   169 |  253 | 0.862378 | 0.807165 | 0.833858 |
|  5 | I-GENE-N      |   1393 |   853 |  598 | 0.620214 | 0.699648 | 0.657541 |
|  6 | Macro-average | 18192  | 4468  | 4417 | 0.766363 | 0.769245 | 0.767801 |
|  7 | Micro-average | 18192  | 4468  | 4417 | 0.802824 | 0.804635 | 0.803729 |