Description
This Named Entity Recognition model uses a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN.
It extracts relevant entities from clinical trial abstracts. It uses a simplified version of the ontology specified by Sanchez Graillet, O., et al. in order to extract concepts related to trial design, diseases, drugs, population, statistics and publication.
Predicted Entities
Age
, AllocationRatio
, Author
, BioAndMedicalUnit
, CTAnalysisApproach
, CTDesign
, Confidence
, Country
, DisorderOrSyndrome
, DoseValue
, Drug
, DrugTime
, Duration
, Journal
, NumberPatients
, PMID
, PValue
, PercentagePatients
, PublicationYear
, TimePoint
, Value
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "en", "clinical/models")\
.setInputCols(["sentence","token", "embeddings"])\
.setOutputCol("ner")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner])
text = ["A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime."]
data = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner))
val text = "A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime."
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.ner.clinical_trials_abstracts").predict("""A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.""")
Results
+-----------+--------------------+
| token| ner_label|
+-----------+--------------------+
| A| O|
| one-year| O|
| ,| O|
| randomised| B-CTDesign|
| ,| O|
|multicentre| B-CTDesign|
| trial| O|
| comparing| O|
| insulin| B-Drug|
| glargine| I-Drug|
| with| O|
| NPH| B-Drug|
| insulin| I-Drug|
| in| O|
|combination| O|
| with| O|
| oral| O|
| agents| O|
| in| O|
| patients| O|
| with| O|
| type|B-DisorderOrSyndrome|
| 2|I-DisorderOrSyndrome|
| diabetes|I-DisorderOrSyndrome|
| .| O|
| In| O|
| a| O|
|multicentre| B-CTDesign|
| ,| O|
| open| B-CTDesign|
| ,| O|
| randomised| B-CTDesign|
| study| O|
| ,| O|
| 570| B-NumberPatients|
| patients| O|
| with| O|
| Type|B-DisorderOrSyndrome|
| 2|I-DisorderOrSyndrome|
| diabetes|I-DisorderOrSyndrome|
| ,| O|
| aged| O|
| 34| B-Age|
| -| O|
| 80| B-Age|
| years| O|
| ,| O|
| were| O|
| treated| O|
| for| O|
| 52| B-Duration|
| weeks| I-Duration|
| with| O|
| insulin| B-Drug|
| glargine| I-Drug|
| or| O|
| NPH| B-Drug|
| insulin| I-Drug|
| given| O|
| once| B-DrugTime|
| daily| I-DrugTime|
| at| O|
| bedtime| B-DrugTime|
| .| O|
+-----------+--------------------+
Model Information
Model Name: | ner_clinical_trials_abstracts |
Compatibility: | Healthcare NLP 3.5.3+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.7 MB |
References
Benchmarking
label precision recall f1-score support
Age 0.88 0.61 0.72 38
AllocationRatio 1.00 1.00 1.00 24
Author 0.93 0.92 0.92 789
BioAndMedicalUnit 0.95 0.94 0.95 785
CTAnalysisApproach 1.00 0.87 0.93 23
CTDesign 0.91 0.95 0.93 410
Confidence 0.95 0.95 0.95 899
Country 0.94 0.86 0.90 123
DisorderOrSyndrome 0.99 0.98 0.99 568
DoseValue 0.96 0.97 0.97 263
Drug 0.96 0.95 0.96 1290
DrugTime 0.97 0.85 0.91 377
Duration 0.89 0.86 0.88 271
Journal 0.95 0.93 0.94 175
NumberPatients 0.95 0.94 0.94 173
O 0.98 0.98 0.98 21613
PMID 1.00 1.00 1.00 55
PValue 0.97 0.99 0.98 654
PercentagePatients 0.92 0.92 0.92 235
PublicationYear 0.86 0.96 0.91 57
TimePoint 0.85 0.75 0.80 514
Value 0.94 0.94 0.94 1195
accuracy - - 0.97 30531
macro-avg 0.94 0.91 0.93 30531
weighted-avg 0.97 0.97 0.97 30531