Description
This Named Entity Recognition model is trained with the BertForTokenClassification method from transformers library and imported into Spark NLP.
It extracts relevant entities from clinical trial abstracts. It uses a simplified version of the ontology specified by Sanchez Graillet, O., et al. in order to extract concepts related to trial design, diseases, drugs, population, statistics and publication.
Predicted Entities
Age, AllocationRatio, Author, BioAndMedicalUnit, CTAnalysisApproach, CTDesign, Confidence, Country, DisorderOrSyndrome, DoseValue, Drug, DrugTime, Duration, Journal, NumberPatients, PMID, PValue, PercentagePatients, PublicationYear, TimePoint, Value
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "en", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
text = ["This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime."]
data = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
.setCaseSensitive(True)
val. ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val text = "This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime."
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.clinical_trials_abstracts").predict("""This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime.""")
Results
+----------------+------------------+
|chunk |ner_label |
+----------------+------------------+
|open-label |CTDesign |
|parallel-group |CTDesign |
|two-arm |CTDesign |
|insulin glargine|Drug |
|GLA |Drug |
|NPH insulin |Drug |
|metformin |Drug |
|28 |NumberPatients |
|type 2 diabetes |DisorderOrSyndrome|
|61.5 |Age |
|kg/m(2 |BioAndMedicalUnit |
|metformin |Drug |
|sulfonylurea |Drug |
|randomized |CTDesign |
|once-daily |DrugTime |
|GLA |Drug |
|NPH |Drug |
|bedtime |DrugTime |
+----------------+------------------+
Model Information
| Model Name: | bert_token_classifier_ner_clinical_trials_abstracts |
| Compatibility: | Healthcare NLP 3.5.3+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token] |
| Output Labels: | [ner] |
| Language: | en |
| Size: | 404.3 MB |
| Case sensitive: | true |
| Max sentence length: | 128 |
References
Benchmarking
label precision recall f1-score support
B-Age 0.93 0.88 0.90 16
B-AllocationRatio 1.00 1.00 1.00 7
B-Author 0.98 1.00 0.99 702
B-BioAndMedicalUnit 0.96 0.97 0.96 723
B-CTAnalysisApproach 1.00 1.00 1.00 5
B-CTDesign 0.93 0.95 0.94 384
B-Confidence 0.91 0.95 0.93 184
B-Country 0.88 0.91 0.90 115
B-DisorderOrSyndrome 0.92 0.96 0.94 393
B-DoseValue 0.97 0.98 0.97 117
B-Drug 0.97 0.98 0.97 3944
B-DrugTime 0.92 0.90 0.91 202
B-Duration 0.90 0.88 0.89 100
B-Journal 1.00 1.00 1.00 131
B-NumberPatients 0.94 0.98 0.96 165
B-PMID 1.00 1.00 1.00 239
B-PValue 0.86 0.89 0.88 132
B-PercentagePatients 0.93 0.97 0.95 105
B-PublicationYear 1.00 0.98 0.99 57
B-TimePoint 0.78 0.87 0.82 306
B-Value 0.89 0.87 0.88 407
I-Age 1.00 0.45 0.62 22
I-AllocationRatio 1.00 1.00 1.00 14
I-Author 0.99 0.98 0.99 590
I-BioAndMedicalUnit 0.97 0.99 0.98 344
I-CTAnalysisApproach 0.90 1.00 0.95 18
I-CTDesign 0.84 0.89 0.87 183
I-Confidence 0.92 0.98 0.95 753
I-Country 0.00 0.00 0.00 10
I-DisorderOrSyndrome 0.99 0.98 0.99 600
I-DoseValue 0.99 0.98 0.98 164
I-Drug 0.90 0.89 0.90 393
I-DrugTime 0.96 0.80 0.88 192
I-Duration 0.90 0.84 0.87 165
I-Journal 0.98 0.99 0.99 238
I-NumberPatients 1.00 0.95 0.98 22
I-PValue 0.96 0.99 0.98 612
I-PercentagePatients 0.99 1.00 1.00 130
I-TimePoint 0.81 0.78 0.79 282
I-Value 0.93 0.96 0.95 787
O 0.99 0.98 0.98 24184
accuracy - - 0.97 38137
macro-avg 0.92 0.91 0.91 38137
weighted-avg 0.97 0.97 0.97 38137