Detect chemical compounds and genes

Description

This is a pre-trained model that can be used to automatically detect all chemical compounds and gene mentions from medical texts.

Predicted Entities:

CHEMICAL, GENE-Y, GENE-N

Live Demo Open in ColabDownload

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.


clinical_ner = NerDLModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

results = light_pipeline.fullAnnotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.")
    

Results

+----+---------------------------------+---------+-------+----------+
|    | chunk                           |   begin |   end | entity   |
+====+=================================+=========+=======+==========+
|  0 | Keratinocyte growth factor      |       0 |    25 | GENE-Y   |
+----+---------------------------------+---------+-------+----------+
|  1 | acidic fibroblast growth factor |      31 |    61 | GENE-Y   |
+----+---------------------------------+---------+-------+----------+

Model Information

Model Name: ner_chemprot_clinical
Type: ner
Compatibility: Spark NLP for Healthcare 2.6.0 +
Edition: Official
License: Licensed
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: [en]
Case sensitive: false

Data Source

This model was trained on the ChemProt corpus using ‘embeddings_clinical’ embeddings. Make sure you use the same embeddings when running the model.