Detect Normalized Genes and Human Phenotypes (LangTest)

Description

This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text. It is the version of ner_human_phenotype_go_clinical model augmented with langtest library.

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_abbreviation	116	53	475	538	85%	80%	91%
add_ocr_typo	517	93	192	616	70%	27%	87%
add_typo	201	92	480	594	75%	70%	87%
lowercase	71	44	622	649	90%	90%	94%
titlecase	701	123	8	586	70%	1%	83%
uppercase	707	186	2	523	70%	0%	74%
weighted average	2313	591	1779	3506	77%	43.48%	85.57%

Predicted Entities

GO, HP

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
	.setInputCol("text")\
	.setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
	.setInputCols(["document"]) \
	.setOutputCol("sentence")

tokenizer = Tokenizer()\
	.setInputCols(["sentence"])\
	.setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
	.setInputCols(["sentence", "token"])\
	.setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical_langtest", "en", "clinical/models") \
	.setInputCols(["sentence", "token", "embeddings"]) \
	.setOutputCol("ner")

ner_converter = NerConverter() \
	.setInputCols(["sentence", "token", "ner"]) \
	.setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            clinical_ner,
                            ner_converter])

data = spark.createDataFrame([["""Another disease that shares two of the tumor components of CT, namely GIST and the tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad."""]]).toDF("text")

result = pipeline.fit(data).transform(data)

val document_assembler = new DocumentAssembler()
	.setInputCol("text")
	.setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
	.setInputCols("document")
	.setOutputCol("sentence")

val tokenizer = new Tokenizer()
	.setInputCols("sentence")
	.setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
	.setInputCols(Array("sentence", "token"))
	.setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical_langtest", "en", "clinical/models")
	.setInputCols(Array("sentence", "token", "embeddings"))
	.setOutputCol("ner")

val ner_converter = new NerConverter()
	.setInputCols(Array("sentence", "token", "ner"))
	.setOutputCol("entities")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.").toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+------------------------+---------+
|chunk                   |ner_label|
+------------------------+---------+
|tumor                   |HP       |
|tricarboxylic acid cycle|GO       |
+------------------------+---------+

Model Information

Model Name:	ner_human_phenotype_go_clinical_langtest
Compatibility:	Healthcare NLP 5.1.1+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.6 MB

Benchmarking

label         precision  recall  f1-score  support 
GO            0.89       0.81    0.85      1363    
HP            0.84       0.85    0.85      762     
micro-avg     0.87       0.82    0.85      2125    
macro-avg     0.86       0.83    0.85      2125    
weighted-avg  0.87       0.82    0.85      2125    

PREVIOUSDetect Genes and Human Phenotypes (LangTest)

NEXTDate of Brith Contextual Parser Model