Description
This model extracts mentions of treatments and posology information using unspecific labels (low granularity). It is the version of ner_oncology_unspecific_posology model augmented with langtest
library.
Definitions of Predicted Entities:
Cancer_Therapy
: Mentions of cancer treatments, including chemotherapy, radiotherapy, surgery, and others.Posology_Information
: Terms related to the posology of the treatment, including duration, frequencies, and dosage.
test_type | before fail_count | after fail_count | before pass_count | after pass_count | minimum pass_rate | before pass_rate | after pass_rate |
---|---|---|---|---|---|---|---|
add_ocr_typo | 658 | 228 | 630 | 1060 | 70% | 49% | 82% |
add_slangs | 20 | 14 | 1268 | 1274 | 60% | 98% | 99% |
add_typo | 167 | 142 | 1121 | 1146 | 60% | 87% | 89% |
lowercase | 166 | 116 | 1122 | 1172 | 70% | 87% | 91% |
titlecase | 600 | 200 | 688 | 1088 | 70% | 53% | 84% |
uppercase | 1195 | 268 | 93 | 1020 | 60% | 7% | 79% |
weighted average | 2806 | 968 | 4922 | 6760 | 65% | 63.69% | 87.47% |
Predicted Entities
Cancer_Therapy
, Posology_Information
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter])
data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text")
result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology_langtest", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter))
val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+----------------+--------------------+
|chunk |ner_label |
+----------------+--------------------+
|adriamycin |Cancer_Therapy |
|60 mg/m2 |Posology_Information|
|cyclophosphamide|Cancer_Therapy |
|600 mg/m2 |Posology_Information|
|six courses |Posology_Information|
|second cycle |Posology_Information|
|chemotherapy |Cancer_Therapy |
+----------------+--------------------+
Model Information
Model Name: | ner_oncology_unspecific_posology_langtest |
Compatibility: | Healthcare NLP 5.1.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.6 MB |
References
In-house annotated oncology case reports.
Benchmarking
label precision recall f1-score support
Cancer_Therapy 0.90 0.90 0.90 1845
Posology_Information 0.87 0.86 0.87 1199
micro-avg 0.89 0.89 0.89 3044
macro-avg 0.89 0.88 0.88 3044
weighted-avg 0.89 0.89 0.89 3044