Description
It is a BertForTokenClassification
NER model to identify concepts related to drug development including Trial Groups
, End Points
, Hazard Ratio
and other entities in free text.
Predicted Entities
Patient_Count
, Duration
, End_Point
, Value
, Trial_Group
, Hazard_Ratio
, Total_Patients
Live Demo Open in Colab Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("ner")\
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
test_sentence = """In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan."""
data = spark.createDataFrame([[test_sentence]]).toDF('text')
result = pipeline.fit(data).transform(data)
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val data = Seq("In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.ner.drug_development_trials").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""")
Results
+-----------------+-------------+
|chunk |ner_label |
+-----------------+-------------+
|median |Duration |
|overall survival |End_Point |
|with |Trial_Group |
|without topotecan|Trial_Group |
|4.0 |Value |
|3.6 months |Value |
|23 |Patient_Count|
|63 |Patient_Count|
|55 |Patient_Count|
|33 patients |Patient_Count|
|topotecan |Trial_Group |
|11 |Patient_Count|
|61 |Patient_Count|
|66 |Patient_Count|
|32 patients |Patient_Count|
|without topotecan|Trial_Group |
+-----------------+-------------+
Model Information
Model Name: | bert_token_classifier_drug_development_trials |
Compatibility: | Healthcare NLP 3.3.4+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 404.4 MB |
Case sensitive: | true |
Max sentence length: | 256 |
References
Trained on data obtained from clinicaltrials.gov
and annotated in-house.
Benchmarking
label prec rec f1 support
B-Duration 0.93 0.94 0.93 1820
B-End_Point 0.99 0.98 0.98 5022
B-Hazard_Ratio 0.97 0.95 0.96 778
B-Patient_Count 0.81 0.88 0.85 300
B-Trial_Group 0.86 0.88 0.87 6751
B-Value 0.94 0.96 0.95 7675
I-Duration 0.71 0.82 0.76 185
I-End_Point 0.94 0.98 0.96 1491
I-Patient_Count 0.48 0.64 0.55 44
I-Trial_Group 0.78 0.75 0.77 4561
I-Value 0.93 0.95 0.94 1511
O 0.96 0.95 0.95 47423
accuracy 0.94 0.94 0.94 77608
macro-avg 0.79 0.82 0.80 77608
weighted-avg 0.94 0.94 0.94 77608