Detect Cellular/Molecular Biology Entities (clinical_medium)

Description

Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.

Predicted Entities

DNA, RNA, Cell_type, Cell_line, Protein

Live Demo Open in Colab Copy S3 URICopied!

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_medium", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(['sentence', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter   
    ])

sample_df = spark.createDataFrame([["""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""]]).toDF("text")

result = pipeline.fit(sample_df).transform(sample_df)

Results

+-----------------------------------------------------------+-----+---+---------+
|chunk                                                      |begin|end|ner_label|
+-----------------------------------------------------------+-----+---+---------+
|intracellular signaling proteins                           |27   |58 |protein  |
|human T-cell leukemia virus type 1 promoter                |130  |172|DNA      |
|Tax                                                        |186  |188|protein  |
|Tax-responsive element 1                                   |193  |216|DNA      |
|cyclic AMP-responsive members                              |237  |265|protein  |
|CREB/ATF family                                            |274  |288|protein  |
|transcription factors                                      |293  |313|protein  |
|Tax                                                        |389  |391|protein  |
|human T-cell leukemia virus type 1 Tax-responsive element 1|396  |454|DNA      |
|TRE-1),                                                    |457  |463|DNA      |
|lacZ gene                                                  |582  |590|DNA      |
|CYC1 promoter                                              |617  |629|DNA      |
|TRE-1                                                      |663  |667|DNA      |
|cyclic AMP response element-binding protein                |695  |737|protein  |
|CREB                                                       |740  |743|protein  |
|CREB                                                       |749  |752|protein  |
|GAL4 activation domain                                     |767  |788|protein  |
|GAD                                                        |791  |793|protein  |
|reporter gene                                              |848  |860|DNA      |
|Tax                                                        |863  |865|protein  |
+-----------------------------------------------------------+-----+---+---------+

Model Information

Model Name: ner_cellular_emb_clinical_medium
Compatibility: Healthcare NLP 4.4.2+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 2.8 MB

References

Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. https://www.geniaproject.org/

Benchmarking

       label  precision    recall  f1-score   support
   cell_line       0.59      0.80      0.68      1489
   cell_type       0.89      0.76      0.82      4912
     protein       0.80      0.90      0.85      9841
         RNA       0.79      0.83      0.81       305
         DNA       0.78      0.86      0.82      2845
   micro-avg       0.80      0.85      0.82     19392
   macro-avg       0.77      0.83      0.79     19392
weighted-avg       0.80      0.85      0.82     19392