Description
Extract different types of chemical compounds mentioned in text using pretrained NER model. ( Trained with embeddings_clinical_large )
Predicted Entities
Live Demo Open in Colab Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
chemicals_ner = MedicalNerModel.pretrained("ner_chemicals_emb_clinical_large", "en", "clinical/models" ) \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("chemicals_ner")
chemicals_ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "species_ner"]) \
.setOutputCol("chemicals_ner_chunk")
chemicals_ner_pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
chemicals_ner,
chemicals_ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
chemicals_ner_model = chemicals_ner_pipeline.fit(empty_data)
results = chemicals_ner_model.transform(spark.createDataFrame([['''Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene) glucosides against oxidative stress. Resveratrol (trans - 3, 5, 4 - trihydroxystilbene ; RSV) , a natural polyphenol, exerts a beneficial effect on health and diseases.
RSV targets and activates the NAD(+) - dependent protein deacetylase SIRT1; in turn, SIRT1 induces an intracellular antioxidative mechanism by inducing mitochondrial superoxide dismutase (SOD2). Most RSV found in plants is glycosylated, and the effect of these glycosylated forms on SIRT1 has not been studied. ''']]).toDF("text"))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val anatomy_ner_model = MedicalNerModel.pretrained("ner_chemicals_emb_clinical_large", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("chemicals_ner")
val anatomy_ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "chemicals_ner"))
.setOutputCol("chemicals_ner_chunk")
val chemicals_pipeline = new PipelineModel().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
chemicals_ner_model,
chemicals_ner_converter))
val data = Seq(""" Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene) glucosides against oxidative stress. Resveratrol (trans - 3, 5, 4 - trihydroxystilbene ; RSV) , a natural polyphenol, exerts a beneficial effect on health and diseases.
RSV targets and activates the NAD(+) - dependent protein deacetylase SIRT1; in turn, SIRT1 induces an intracellular antioxidative mechanism by inducing mitochondrial superoxide dismutase (SOD2). Most RSV found in plants is glycosylated, and the effect of these glycosylated forms on SIRT1 has not been studied.""").toDS.toDF("text")
val result = model.fit(data).transform(data)
Results
| | chunks | begin | end | entities |
|---:|:-------------------------------------------------|--------:|------:|:-----------|
| 0 | resveratrol | 48 | 58 | CHEM |
| 1 | trans - 3, 5, 4 - trihydroxystilbene) glucosides | 61 | 108 | CHEM |
| 2 | Resveratrol | 136 | 146 | CHEM |
| 3 | trans - 3, 5, 4 - trihydroxystilbene | 149 | 185 | CHEM |
| 4 | RSV | 189 | 191 | CHEM |
| 5 | polyphenol | 206 | 215 | CHEM |
| 6 | NAD(+) | 300 | 305 | CHEM |
| 7 | superoxide | 436 | 445 | CHEM |
Model Information
Model Name: | ner_chemicals_emb_clinical_large |
Compatibility: | Healthcare NLP 4.4.3+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [document, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 2.8 MB |
Benchmarking
label precision recall f1-score support
CHEM 0.94 0.93 0.94 62001
micro-avg 0.94 0.93 0.94 62001
macro-avg 0.94 0.93 0.94 62001
weighted-avg 0.94 0.93 0.94 62001