Description
This is a pretrained named entity recognition deep learning model for clinical terminology. It is based on the bert_token_classifier_ner_jsl
model, but with more generalized entities. This model is trained with BertForTokenClassification method from the transformers
library and imported into Spark NLP.
Definitions of Predicted Entities:
Death_Entity
: Mentions that indicate the death of a patient.Medical_Device
: All mentions related to medical devices and supplies.Vital_Signs_Header
: Identifies section headers that correspond to Vital Signs of a patient.Allergen
: Allergen related extractions mentioned in the document.Drug_BrandName
: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients.Clinical_Dept
: Terms that indicate the medical and/or surgical departments.Symptom
: All the symptoms mentioned in the document, of a patient or someone else.External_body_part_or_region
: All mentions related to external body parts or organs that can be examined by naked eye.Admission_Discharge
: Terms that indicate the admission and/or the discharge of a patient.Age
: All mention of ages, past or present, related to the patient or with anybody else.Birth_Entity
: Mentions that indicate giving birth.Oncological
: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else.Substance_Quantity
: All mentions of substance quantity (quantitative information related to illicit/recreational drugs).Test_Result
: Terms related to all the test results present in the document (clinical tests results are included).Test
: Mentions of laboratory, pathology, and radiological tests.Procedure
: All mentions of invasive medical or surgical procedures or treatments.Treatment
: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as “Procedure”).Disease_Syndrome_Disorder
: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as “Heart_Disease” etc.).
Predicted Entities
Death_Entity
, Medical_Device
, Vital_Sign
, Alergen
, Drug
, Clinical_Dept
, Lifestyle
, Symptom
, Body_Part
, Physical_Measurement
, Admission_Discharge
, Date_Time
, Age
, Birth_Entity
, Header
, Oncological
, Substance_Quantity
, Test_Result
, Test
, Procedure
, Treatment
, Disease_Syndrome_Disorder
, Pregnancy_Newborn
, Demographics
Live Demo Open in Colab Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
sample_text = """HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer."""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
.setCaseSensitive(True)
val. ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter))
val data = Seq("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.ner_jsl_slim").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""")
Results
+----------------+------------+
|chunk |ner_label |
+----------------+------------+
|HISTORY: |Header |
|30-year-old |Age |
|female |Demographics|
|mammography |Test |
|soft tissue lump|Symptom |
|shoulder |Body_Part |
|breast cancer |Oncological |
|her mother |Demographics|
|age 58 |Age |
|breast cancer |Oncological |
+----------------+------------+
Model Information
Model Name: | bert_token_classifier_ner_jsl_slim |
Compatibility: | Healthcare NLP 3.3.4+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 404.4 MB |
Case sensitive: | true |
Max sentense length: | 256 |
Data Source
Trained on data annotated by JSL.
Benchmarking
label precision recall f1-score support
B-Admission_Discharge 0.82 0.99 0.90 282
B-Age 0.88 0.83 0.85 576
B-Body_Part 0.84 0.91 0.87 8582
B-Clinical_Dept 0.86 0.94 0.90 909
B-Date_Time 0.82 0.77 0.79 1062
B-Death_Entity 0.66 0.98 0.79 43
B-Demographics 0.97 0.98 0.98 5285
B-Disease_Syndrome_Disorder 0.84 0.89 0.86 4259
B-Drug 0.88 0.87 0.87 2555
B-Header 0.97 0.66 0.78 3911
B-Lifestyle 0.77 0.83 0.80 371
B-Medical_Device 0.84 0.87 0.85 3605
B-Oncological 0.86 0.91 0.89 408
B-Physical_Measurement 0.84 0.81 0.82 135
B-Pregnancy_Newborn 0.66 0.71 0.68 245
B-Procedure 0.82 0.88 0.85 2654
B-Symptom 0.83 0.86 0.85 6545
B-Test 0.82 0.83 0.83 2448
B-Test_Result 0.76 0.81 0.78 1280
B-Treatment 0.70 0.76 0.73 275
B-Vital_Sign 0.85 0.87 0.86 627
I-Age 0.84 0.90 0.87 166
I-Alergen 0.00 0.00 0.00 5
I-Body_Part 0.86 0.89 0.88 4946
I-Clinical_Dept 0.92 0.93 0.93 806
I-Date_Time 0.82 0.91 0.86 1173
I-Demographics 0.89 0.84 0.86 416
I-Disease_Syndrome_Disorder 0.87 0.85 0.86 4385
I-Drug 0.83 0.86 0.85 5199
I-Header 0.85 0.97 0.90 6763
I-Lifestyle 0.77 0.69 0.73 134
I-Medical_Device 0.86 0.86 0.86 2341
I-Oncological 0.85 0.94 0.89 515
I-Physical_Measurement 0.88 0.94 0.91 329
I-Pregnancy_Newborn 0.66 0.70 0.68 273
I-Procedure 0.87 0.86 0.87 3414
I-Symptom 0.79 0.75 0.77 6485
I-Test 0.82 0.77 0.79 2283
I-Test_Result 0.67 0.56 0.61 649
I-Treatment 0.69 0.72 0.70 194
I-Vital_Sign 0.88 0.90 0.89 918
O 0.97 0.97 0.97 210520
accuracy - - 0.94 297997
macro-avg 0.74 0.74 0.73 297997
weighted-avg 0.94 0.94 0.94 297997