Detect Genetic Cancer Entities

Description

Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Pretrained named entity recognition deep learning model for biology and genetics terms.

Predicted Entities

DNA, RNA, cell_line, cell_type, protein.

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_cancer_genetics", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
 	  .setInputCols(["sentence", "token", "ner"])\
 	  .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.']], ["text"]))
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
         
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_cancer_genetics", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
 	.setInputCols(Array("sentence", "token", "ner"))
 	.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""").toDS().toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.cancer").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""")

Results

+-------------------+---------+
|              token|ner_label|
+-------------------+---------+
|                The|        O|
|              human|B-protein|
|              KCNJ9|I-protein|
|                  (|        O|
|                Kir|B-protein|
|                3.3|I-protein|
|                  ,|        O|
|              GIRK3|B-protein|
|                  )|        O|
|                 is|        O|
|                  a|        O|
|             member|        O|
|                 of|        O|
|                the|        O|
|G-protein-activated|B-protein|
|           inwardly|I-protein|
|         rectifying|I-protein|
|          potassium|I-protein|
|                  (|I-protein|
|               GIRK|I-protein|
|                  )|I-protein|
|            channel|I-protein|
|             family|I-protein|
|                  .|        O|
|               Here|        O|
|                 we|        O|
|           describe|        O|
|                the|        O|
|genomicorganization|        O|
|                 of|        O|
|                the|        O|
|              KCNJ9|    B-DNA|
|              locus|    I-DNA|
|                 on|        O|
|         chromosome|    B-DNA|
|            1q21-23|    I-DNA|
+-------------------+---------+

Model Information

Model Name: ner_cancer_genetics
Compatibility: Healthcare NLP 3.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en

Data Source

Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013 with embeddings_clinical. https://aclanthology.org/W13-2008/

Benchmarking

label        tp    fp   fn   prec         rec          f1

B-cell_line  581   148  151  0.79698217   0.79371583   0.79534566
I-DNA        2751  735  317  0.7891566    0.89667535   0.8394873
I-protein    4416  867  565  0.8358887    0.88656896   0.8604832
B-protein    5288  763  660  0.8739051    0.8890383    0.8814068
I-cell_line  1148  244  301  0.82471263   0.79227054   0.80816615
I-RNA        221   60   27   0.78647685   0.891129     0.83553874
B-RNA        157   40   36   0.79695433   0.8134715    0.8051282
B-cell_type  1127  292  250  0.7942213    0.8184459    0.8061516
I-cell_type  1547  392  263  0.7978339    0.85469615   0.82528675
B-DNA        1513  444  387  0.77312213   0.7963158    0.7845475

Macro-average  prec: 0.8069253,  rec: 0.84323275, f1: 0.82467955
Micro-average  prec: 0.82471186, rec: 0.86377037, f1: 0.84378934