Detect Cancer Genetics (LangTest)

Description

Pretrained named entity recognition deep learning model for biology and genetics terms. It is the version of ner_bionlp model augmented with langtest library.

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_ocr_typo	654	121	610	1143	70%	48%	90%
lowercase	463	307	802	958	70%	63%	76%
strip_all_punctuation	220	219	1059	1060	70%	83%	83%
titlecase	714	373	563	904	60%	44%	71%
uppercase	1161	464	122	819	60%	10%	64%
weighted average	3212	1484	3156	4884	66%	49.56%	76.70%

Predicted Entities

Amino_acid, Anatomical_system, Cancer, Cell, Cellular_component, Developing_anatomical_Structure, Gene_or_gene_product, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism, Organism_subdivision, Simple_chemical, Tissue

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_bionlp_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
 	.setInputCols(["sentence", "token", "ner"])\
 	.setOutputCol("ner_chunk")
    
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([["""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene for Type II diabetes mellitus in the Pima Indian population. The gene spans approximately 7.6 kb and contains one noncoding and two coding exons separated by approximately 2.2 and approximately 2.6 kb introns, respectively. We identified 14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various human tissues including the pancreas, and two major insulin-responsive tissues. The characterization of the KCNJ9 gene should facilitate further studies on the function of the KCNJ9 protein and allow evaluation of the potential role of the locus in Type II diabetes."""]], ["text"]))

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
         
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_bionlp_langtest", "en", "clinical/models")
    .setInputCols("sentence", "token", "embeddings")
    .setOutputCol("ner")

val ner_converter = new NerConverter()
 	.setInputCols(Array("sentence", "token", "ner"))
 	.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene for Type II diabetes mellitus in the Pima Indian population. The gene spans approximately 7.6 kb and contains one noncoding and two coding exons separated by approximately 2.2 and approximately 2.6 kb introns, respectively. We identified 14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various human tissues including the pancreas, and two major insulin-responsive tissues. The characterization of the KCNJ9 gene should facilitate further studies on the function of the KCNJ9 protein and allow evaluation of the potential role of the locus in Type II diabetes.""").toDS().toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+-----------------------------+--------------------+
|chunk                        |ner_label           |
+-----------------------------+--------------------+
|human                        |Organism            |
|Kir 3.3                      |Gene_or_gene_product|
|GIRK3                        |Gene_or_gene_product|
|inwardly rectifying potassium|Gene_or_gene_product|
|GIRK                         |Gene_or_gene_product|
|chromosome 1q21-23           |Cellular_component  |
|Type II                      |Gene_or_gene_product|
|human                        |Organism            |
|tissues                      |Tissue              |
|pancreas                     |Organ               |
|insulin-responsive tissues   |Tissue              |
|KCNJ9                        |Gene_or_gene_product|
|KCNJ9                        |Gene_or_gene_product|
|locus                        |Cellular_component  |
+-----------------------------+--------------------+

Model Information

Model Name:	ner_bionlp_langtest
Compatibility:	Healthcare NLP 5.1.1+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.6 MB

References

Trained on the Cancer Genetics (CG) task of the BioNLP Shared Task 2013

Benchmarking

label                            precision  recall  f1-score  support 
Amino_acid                       0.96       0.62    0.75      37      
Anatomical_system                0.89       0.62    0.73      13      
Cancer                           0.92       0.90    0.91      570     
Cell                             0.93       0.92    0.92      806     
Cellular_component               0.86       0.89    0.87      141     
Developing_anatomical_structure  0.75       0.60    0.67      5       
Gene_or_gene_product             0.93       0.93    0.93      1818    
Immaterial_anatomical_entity     0.92       0.76    0.83      29      
Multi-tissue_structure           0.86       0.79    0.82      196     
Organ                            0.90       0.91    0.90      85      
Organism                         0.94       0.90    0.92      414     
Organism_subdivision             0.74       0.64    0.68      22      
Organism_substance               0.86       0.89    0.87      61      
Pathological_formation           0.78       0.76    0.77      46      
Simple_chemical                  0.94       0.93    0.93      538     
Tissue                           0.76       0.83    0.79      110     
micro-avg                        0.92       0.91    0.91      4891    
macro-avg                        0.87       0.80    0.83      4891    
weighted-avg                     0.92       0.91    0.91      4891    

PREVIOUSExtraction of Biomarker Information (LangTest)

NEXTDetect Problems, Tests and Treatments (ner_clinical_large - LangTest)