Description
Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
Predicted Entities
DNA
, RNA
, Cell_type
, Cell_line
, Protein
Live Demo Open in Colab Copy S3 URICopied!
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_medium", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(['sentence', 'token', 'ner'])\
.setOutputCol('ner_chunk')
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_df = spark.createDataFrame([["""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""]]).toDF("text")
result = pipeline.fit(sample_df).transform(sample_df)
Results
+-----------------------------------------------------------+-----+---+---------+
|chunk |begin|end|ner_label|
+-----------------------------------------------------------+-----+---+---------+
|intracellular signaling proteins |27 |58 |protein |
|human T-cell leukemia virus type 1 promoter |130 |172|DNA |
|Tax |186 |188|protein |
|Tax-responsive element 1 |193 |216|DNA |
|cyclic AMP-responsive members |237 |265|protein |
|CREB/ATF family |274 |288|protein |
|transcription factors |293 |313|protein |
|Tax |389 |391|protein |
|human T-cell leukemia virus type 1 Tax-responsive element 1|396 |454|DNA |
|TRE-1), |457 |463|DNA |
|lacZ gene |582 |590|DNA |
|CYC1 promoter |617 |629|DNA |
|TRE-1 |663 |667|DNA |
|cyclic AMP response element-binding protein |695 |737|protein |
|CREB |740 |743|protein |
|CREB |749 |752|protein |
|GAL4 activation domain |767 |788|protein |
|GAD |791 |793|protein |
|reporter gene |848 |860|DNA |
|Tax |863 |865|protein |
+-----------------------------------------------------------+-----+---+---------+
Model Information
Model Name: | ner_cellular_emb_clinical_medium |
Compatibility: | Healthcare NLP 4.4.2+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 2.8 MB |
References
Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. https://www.geniaproject.org/
Benchmarking
label precision recall f1-score support
cell_line 0.59 0.80 0.68 1489
cell_type 0.89 0.76 0.82 4912
protein 0.80 0.90 0.85 9841
RNA 0.79 0.83 0.81 305
DNA 0.78 0.86 0.82 2845
micro-avg 0.80 0.85 0.82 19392
macro-avg 0.77 0.83 0.79 19392
weighted-avg 0.80 0.85 0.82 19392