Detect chemical compounds and genes


This is a pre-trained model that can be used to automatically detect all chemical compounds and gene mentions from medical texts.

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

clinical_ner = NerDLModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

light_pipeline = LightPipeline([['']]).toDF("text")))

results = light_pipeline.fullAnnotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.")


|    | chunk                           |   begin |   end | entity   |
|  0 | Keratinocyte growth factor      |       0 |    25 | GENE-Y   |
|  1 | acidic fibroblast growth factor |      31 |    61 | GENE-Y   |

Model Information

Model Name: ner_chemprot_clinical
Type: ner
Compatibility: Spark NLP for Healthcare 2.6.0 +
Edition: Official
License: Licensed
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: [en]
Case sensitive: false

Data Source

This model was trained on the ChemProt corpus using ‘embeddings_clinical’ embeddings. Make sure you use the same embeddings when running the model.