Description
This model maps extracted medical entities to ICD-O codes using Bert Sentence Embeddings.
Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding Morphology
codes comprising of Histology
and Behavior
codes.
Predicted Entities
ICD-O Codes and their normalized definition with sbiobert_base_cased_mli
embeddings.
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["Oncological"])
c2doc = Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sentence_embeddings")\
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo", "en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")\
resolver_pipeline = Pipeline(stages = [
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter,
c2doc,
sbert_embedder,
resolver
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = resolver_pipeline.fit(empty_data)
text="""In our patient experiencing intestinal bleeding and complaining of chest swelling, samples were taken, revealing the presence of adenomyoepitelioma and mucinous adenocarcinoma."""
lmodel = LightPipeline(model)
result = lmodel.fullAnnotate(text)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_jsl","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Oncological"))
val c2doc = new Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sentence_embeddings")
val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo","en","clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")
val resolver_pipeline = val Pipeline(stages = new Array(
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
ner,
ner_converter,
c2doc,
sbert_embedder,
resolver ))
val empty_data = Seq("") .toDF("text")
val model = resolver_pipeline.fit(empty_data)
val text="In our patient experiencing intestinal bleeding and complaining of chest swelling,samples were taken,revealing the presence of adenomyoepitelioma and mucinous adenocarcinoma."
val lmodel = new LightPipeline(model)
val result = lmodel.fullAnnotate(text)
Results
| | chunks | begin | end | entity | code | confidence | all_codes | resolutions |
|--:|------------------------:|------:|----:|------------:|-------:|-----------:|---------------------------------------------------------------------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| 0 | adenomyoepitelioma | 129 | 146 | Oncological | 8983/3 | 0.9049 | [8983/3, 8562/3, 8413/3, 9522/3, 9523/3] | [Adenomyoepithelioma with carcinoma, Epithelial-myoepithelial carcinoma, Eccrine adenocarcinoma, Olfactory neuroblastoma, Olfactory neuroepithelioma] |
| 1 | mucinous adenocarcinoma | 152 | 174 | Oncological | 8480/3 | 0.74355 | [8480/3, 8481/3, 8420/3, 8253/2, 8550/3, 8290/3, 8262/3, 8253/3, 8323/3, 8213/3] | [Mucinous adenocarcinoma , Mucin-producing adenocarcinoma, Ceruminous adenocarcinoma, Adenocarcinoma in situ, mucinous, Acinar cell carcinoma, Oxyphilic adenocarcinoma, Villous adenocarcinoma, Invasive mucinous adenocarcinoma, Mixed cell adenocarcinoma, Serrated adenocarcinoma] |
Model Information
Name: | sbiobertresolve_icdo |
Type: | SentenceEntityResolverModel |
Compatibility: | Spark NLP 2.6.4 + |
License: | Licensed |
Edition: | Official |
Input labels: | [ner_chunk, chunk_embeddings] |
Output labels: | [resolution] |
Language: | en |
Dependencies: | sbiobert_base_cased_mli |
Data Source
Trained on ICD-O Histology Behaviour dataset with sbiobert_base_cased_mli
sentence embeddings.
https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf