Description
Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.
Definitions of Predicted Entities:
Age
: All mention of ages, past or present, related to the patient or with anybody else.Dosage
: Quantity prescribed by the physician for an active ingredient; measurement units are available described by FDA.Drug_Name
: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients.Frequency
: Frequency of administration for a dose prescribed.Gender
: Gender-specific nouns and pronouns.Symptom
: All the symptoms mentioned in the document, of a patient or someone else.Allergen
: Allergen related extractions mentioned in the document.Blood_Pressure
: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted.Modifier
: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately.O2_Saturation
: Systemic arterial, venous or peripheral oxygen saturation measurements.Procedure
: All mentions of invasive medical or surgical procedures or treatments.Pulse
: Peripheral heart rate, without advanced information like measurement location.Respiration
: Number of breaths per minute.Route
: Drug and medication administration routes available described by FDA.Section_Header
: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels).Temperature
: All mentions that refer to body temperature.Weight
: All mentions related to a patients weight.
Predicted Entities
Age
, Diagnosis
, Dosage
, Drug_Name
, Frequency
, Gender
, Lab_Name
, Lab_Result
, Symptom_Name
, Allergenic_substance
, Blood_Pressure
, Causative_Agents_(Virus_and_Bacteria)
, Modifier
, Name
, Negation
, O2_Saturation
, Procedure
, Procedure_Name
, Pulse_Rate
, Respiratory_Rate
, Route
, Section_Name
, Substance_Name
, Temperature
, Weight
.
Live Demo Open in Colab Copy S3 URI
How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"]))
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_jsl_enriched", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))
val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text")
val result = pipeline.fit(data).transform(data)
Results
The output is a dataframe with a sentence per row and a "ner"
column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select "token.result"
and "ner.result"
from your output dataframe or add the "Finisher"
to the end of your pipeline.
+---------------------------+------------+
|chunk |ner |
+---------------------------+------------+
|21-day-old |Age |
|male |Gender |
|congestion |Symptom_Name|
|mom |Gender |
|suctioning yellow discharge|Symptom_Name|
|she |Gender |
|problems with his breathing|Symptom_Name|
|perioral cyanosis |Symptom_Name|
|retractions |Symptom_Name|
|mom |Gender |
|Tylenol |Drug_Name |
|His |Gender |
|his |Gender |
|respiratory congestion |Symptom_Name|
|He |Gender |
|tired |Symptom_Name|
|fussy |Symptom_Name|
|albuterol |Drug_Name |
+---------------------------+------------+
Model Information
Model Name: | ner_jsl_enriched_en_2.4.2_2.4 |
Type: | ner |
Compatibility: | Spark NLP 2.4.2 |
Edition: | Official |
License: | Licensed |
Input Labels: | [sentence,token, embeddings] |
Output Labels: | [ner] |
Language: | [en] |
Case sensitive: | false |
Data Source
Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/
Benchmarking
label tp fp fn prec rec f1
B-Pulse_Rate 80 26 9 0.754717 0.898876 0.820513
I-Diagnosis 2341 1644 1129 0.587453 0.67464 0.628035
I-Procedure_Name 2209 1128 1085 0.661972 0.670613 0.666265
B-Lab_Result 432 107 263 0.801484 0.621583 0.700162
B-Dosage 465 179 81 0.72205 0.851648 0.781513
I-Causative_Agents_(Virus_and_Bacteria) 9 3 10 0.75 0.473684 0.580645
B-Name 648 295 510 0.687169 0.559585 0.616849
I-Name 917 427 665 0.682292 0.579646 0.626794
B-Weight 52 25 9 0.675325 0.852459 0.753623
B-Symptom_Name 4244 1911 1776 0.689521 0.704983 0.697166
I-Maybe 25 15 63 0.625 0.284091 0.390625
I-Symptom_Name 1920 1584 2503 0.547945 0.434095 0.48442
B-Modifier 1399 704 942 0.66524 0.597608 0.629613
B-Blood_Pressure 82 21 7 0.796117 0.921348 0.854167
B-Frequency 290 93 97 0.75718 0.749354 0.753247
I-Gender 29 19 25 0.604167 0.537037 0.568627
I-Age 3 6 11 0.333333 0.214286 0.26087
B-Drug_Name 1762 500 271 0.778957 0.866699 0.820489
B-Substance_Name 143 32 53 0.817143 0.729592 0.770889
B-Temperature 58 23 11 0.716049 0.84058 0.773333
B-Section_Name 2700 294 177 0.901804 0.938478 0.919775
I-Route 131 165 177 0.442568 0.425325 0.433775
B-Maybe 108 47 164 0.696774 0.397059 0.505855
B-Gender 5156 685 68 0.882726 0.986983 0.931948
I-Dosage 435 182 87 0.705024 0.833333 0.763828
B-Causative_Agents_(Virus_and_Bacteria) 21 17 6 0.552632 0.777778 0.646154
I-Frequency 278 131 191 0.679707 0.592751 0.633257
B-Age 352 34 21 0.911917 0.9437 0.927536
I-Lab_Result 27 20 170 0.574468 0.137056 0.221311
B-Negation 1501 311 341 0.828366 0.814875 0.821565
B-Diagnosis 2657 1281 1049 0.674708 0.716945 0.695186
I-Section_Name 3876 1304 188 0.748263 0.95374 0.838598
B-Route 466 286 123 0.619681 0.791172 0.695004
I-Negation 80 152 190 0.344828 0.296296 0.318725
B-Procedure_Name 1453 739 562 0.662865 0.721092 0.690754
I-Allergenic_substance 6 1 7 0.857143 0.461538 0.6
B-Allergenic_substance 74 31 23 0.704762 0.762887 0.732673
I-Weight 46 43 17 0.516854 0.730159 0.605263
B-Lab_Name 639 189 287 0.771739 0.690065 0.72862
I-Modifier 104 156 417 0.4 0.199616 0.266325
I-Temperature 2 7 13 0.222222 0.133333 0.166667
I-Drug_Name 334 237 290 0.584939 0.535256 0.558996
I-Lab_Name 271 157 140 0.633178 0.659367 0.646007
B-Respiratory_Rate 46 6 5 0.884615 0.901961 0.893204
Macro-average 37896 15237 14343 0.621144 0.562248 0.59023
Micro-average 37896 15237 14343 0.713229 0.725435 0.71928