Detect Drugs and Posology Entities (ner_posology_greedy)

Description

This model detects drugs, dosage, form, frequency, duration, route, and drug strength in text. It differs from ner_posology in the sense that it chunks together drugs, dosage, form, strength, dosage, and route when they appear together, resulting in a bigger chunk. It is trained using embeddings_clinical so please use the same embeddings in the pipeline.

Predicted Entities

DRUG, STRENGTH, DURATION, FREQUENCY, FORM, DOSAGE, ROUTE.

Download

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

...
clinical_ner = NerDLModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
   .setInputCols(["sentence", "token", "embeddings"]) \
   .setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])
model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([["The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day."]]).toDF("text"))
...
val clinical_ner = NerDLModel.pretrained("ner_posology_greedy", "en", "clinical/models")
     .setInputCols(Array("sentence", "token", "embeddings"))
     .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter))

val result = pipeline.fit(Seq.empty["The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day."].toDS.toDF("text")).transform(data)

Results

+----+----------------------------------+---------+-------+------------+
|    | chunks                           |   begin |   end | entities   |
|---:|---------------------------------:|--------:|------:|-----------:|
|  0 | 1 capsule of Advil 10 mg         |      27 |    50 | DRUG       |
|  1 | magnesium hydroxide 100mg/1ml PO |      67 |    98 | DRUG       |
|  2 | for 5 days                       |      52 |    61 | DURATION   |
|  3 | 40 units of insulin glargine     |     168 |   195 | DRUG       |
|  4 | at night                         |     197 |   204 | FREQUENCY  |
|  5 | 12 units of insulin lispro       |     207 |   232 | DRUG       |
|  6 | with meals                       |     234 |   243 | FREQUENCY  |
|  7 | metformin 1000 mg                |     250 |   266 | DRUG       |
|  8 | two times a day                  |     268 |   282 | FREQUENCY  |
+----+----------------------------------+---------+-------+------------+

Model Information

Model Name: ner_posology_greedy
Type: ner
Compatibility: Spark NLP 2.6.5+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Dependencies: embeddings_clinical

Data Source

Trained on augmented i2b2_med7 + FDA dataset with embeddings_clinical, https://www.i2b2.org/NLP/Medication.

Benchmarking

label	 tp	 fp	 fn	 prec	 rec	 f1
B-DRUG	 29362	 1679	 1985	 0.9459103	 0.93667656	 0.94127077
B-STRENGTH	 14018	 1172	 864	 0.922844	 0.9419433	 0.9322958
I-DURATION	 6404	 935	 476	 0.87259847	 0.93081397	 0.9007666
I-STRENGTH	 16686	 1991	 1292	 0.8933983	 0.9281344	 0.9104351
I-FREQUENCY	 19743	 1088	 1081	 0.9477702	 0.94808877	 0.9479294
B-FORM	 2733	 526	 780	 0.8386008	 0.7779676	 0.80714715
B-DOSAGE	 2774	 474	 688	 0.85406405	 0.80127096	 0.8268257
I-DOSAGE	 1357	 490	 844	 0.7347049	 0.6165379	 0.67045456
I-DRUG	 37846	 4103	 3386	 0.90219074	 0.91787934	 0.9099674
I-ROUTE	 208	 30	 62	 0.8739496	 0.77037036	 0.8188976
B-ROUTE	 3061	 340	 451	 0.9000294	 0.87158316	 0.88557786
B-DURATION	 2491	 388	 276	 0.865231	 0.900253	 0.8823946
B-FREQUENCY	 13065	 608	 436	 0.9555328	 0.9677061	 0.9615809
I-FORM	 154	 69	 386	 0.69058293	 0.2851852	 0.40366974
tp: 149902 fp: 13893 fn: 13007 labels: 14
Macro-average	 prec: 0.8712434, rec: 0.82817215, f1: 0.849162
Micro-average	 prec: 0.91518056, rec: 0.92015785, f1: 0.9176625