Description
Pretrained named entity recognition deep learning model for extracting covid-related clinical terminology from covid trials.
Predicted Entities
Stage
, Severity
, Virus
, Trial_Design
, Trial_Phase
, N_Patients
, Institution
, Statistical_Indicator
, Section_Header
, Cell_Type
, Cellular_component
, Viral_components
, Physiological_reaction
, Biological_molecules
, Admission_Discharge
, Age
, BMI
, Cerebrovascular_Disease
, Date
, Death_Entity
, Diabetes
, Disease_Syndrome_Disorder
, Dosage
, Drug_Ingredient
, Employment
, Frequency
, Gender
, Heart_Disease
, Hypertension
, Obesity
, Pulse
, Race_Ethnicity
, Respiration
, Route
, Smoking
, Time
, Total_Cholesterol
, Treatment
, VS_Finding
, Vaccine
, Vaccine_Name
Live Demo Open in Colab Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_covid_trials","en","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")\
.setLabelCasing("upper")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
ner_pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
ner_model = ner_pipeline.fit(empty_data)
text= """In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals."""
results= model.transform(spark.createDataFrame([[text]]).toDF('text'))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_covid_trials", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter))
val data = Seq("""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.covid_trials").predict("""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.""")
Results
| | chunks | begin | end | sentence_id | entities |
|---:|:------------------------------------|--------:|------:|--------------:|:--------------------------|
| 0 | December 2019 | 3 | 15 | 0 | Date |
| 1 | acute respiratory disease | 48 | 72 | 0 | Disease_Syndrome_Disorder |
| 2 | beta-coronavirus | 146 | 161 | 1 | Virus |
| 3 | 2019 | 198 | 201 | 1 | Date |
| 4 | coronavirus infection | 203 | 223 | 1 | Disease_Syndrome_Disorder |
| 5 | SARS-CoV-2 | 228 | 237 | 2 | Virus |
| 6 | coronavirus | 244 | 254 | 2 | Virus |
| 7 | β-coronaviruses | 285 | 299 | 2 | Virus |
| 8 | subgenus Coronaviridae | 308 | 329 | 2 | Virus |
| 9 | SARS-CoV-2 | 337 | 346 | 3 | Virus |
| 10 | zoonotic coronavirus disease | 367 | 394 | 3 | Disease_Syndrome_Disorder |
| 11 | severe acute respiratory syndrome | 402 | 434 | 3 | Disease_Syndrome_Disorder |
| 12 | SARS | 438 | 441 | 3 | Disease_Syndrome_Disorder |
| 13 | Middle Eastern respiratory syndrome | 449 | 483 | 3 | Disease_Syndrome_Disorder |
| 14 | MERS | 487 | 490 | 3 | Disease_Syndrome_Disorder |
| 15 | SARS-CoV-2 | 513 | 522 | 4 | Virus |
| 16 | WHO | 543 | 545 | 4 | Institution |
| 17 | CDC | 549 | 551 | 4 | Institution |
| 18 | 2020 | 852 | 855 | 5 | Date |
| 19 | COVID‑19 vaccine | 868 | 883 | 5 | Vaccine_Name |
Model Information
Model Name: | ner_covid_trials |
Compatibility: | Spark NLP for Healthcare 4.2.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.8 MB |
References
This model is trained on data sampled from clinicaltrials.gov - covid trials, and annotated in-house.
Benchmarking
label tp fp fn total precision recall f1
Institution 34 8 20 55.0 0.7958 0.6343 0.706
VS_Finding 19 2 1 20.0 0.9048 0.95 0.9268
Respiration 5 0 0 5.0 1.0 1.0 1.0
Cerebrovascular_D... 5 2 2 7.0 0.7143 0.7143 0.7143
Cell_Type 152 27 14 167.0 0.8479 0.9123 0.8789
Heart_Disease 36 3 5 41.0 0.9231 0.878 0.9
Severity 57 25 3 60.0 0.6881 0.95 0.7981
N_Patients 27 3 1 29.0 0.8871 0.9483 0.9167
Pulse 12 2 0 12.0 0.8571 1.0 0.9231
Obesity 3 0 0 3.0 1.0 1.0 1.0
Admission_Discharge 85 3 0 85.0 0.9659 1.0 0.9827
Diabetes 8 0 0 8.0 1.0 1.0 1.0
Section_Header 94 8 13 108.0 0.9154 0.8711 0.8927
Age 22 1 0 22.0 0.9429 1.0 0.9706
Cellular_component 40 21 10 50.0 0.6534 0.8 0.7193
Hypertension 10 0 0 10.0 1.0 1.0 1.0
BMI 5 1 1 6.0 0.8333 0.8333 0.8333
Trial_Phase 13 0 1 14.0 0.9398 0.9286 0.9341
Employment 98 12 8 107.0 0.8874 0.9206 0.9037
Statistical_Indic... 76 29 11 88.0 0.7206 0.8689 0.7879
Time 2 0 1 3.0 1.0 0.6667 0.8
Total_Cholesterol 14 1 2 17.0 0.9355 0.8529 0.8923
Drug_Ingredient 327 33 67 395.0 0.9084 0.8281 0.8664
Physiological_rea... 27 7 14 41.0 0.7864 0.6585 0.7168
Treatment 66 4 25 92.0 0.9433 0.7228 0.8185
Vaccine 20 1 2 23.0 0.9531 0.8841 0.9173
Disease_Syndrome_... 774 70 41 816.0 0.9171 0.9495 0.933
Virus 121 8 23 144.0 0.9365 0.8403 0.8858
Frequency 57 1 2 59.9 0.9787 0.9556 0.967
Route 37 4 10 47.0 0.9024 0.7872 0.8409
Death_Entity 20 9 3 23.0 0.6897 0.8696 0.7692
Stage 4 0 7 12.0 1.0 0.3889 0.56
Vaccine_Name 10 1 0 10.0 0.9091 1.0 0.9524
Trial_Design 32 13 8 41.0 0.7149 0.7951 0.7529
Biological_molecules 251 91 53 305.0 0.7335 0.8233 0.7758
Date 98 5 2 100.0 0.9492 0.98 0.9643
Race_Ethnicity 0 0 2 2.0 0.0 0.0 0.0
Gender 46 1 0 46.0 0.9787 1.0 0.9892
Dosage 49 9 24 73.0 0.8376 0.6712 0.7452
Viral_components 18 10 15 34.0 0.6512 0.549 0.5957
macro - - - - - - 0.8382
micro - - - - - - 0.8704