Extract Demographic Entities from Social Determinants of Health Texts

Description

This model extracts demographic information related to Social Determinants of Health from various kinds of biomedical documents.

Predicted Entities

Family_Member, Age, Gender, Geographic_Entity, Race_Ethnicity, Language, Spiritual_Beliefs

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_sdoh_demographics_wip", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter   
    ])

sample_texts = ["SOCIAL HISTORY: He is a former tailor from Korea.",
             "He lives alone,single and no children.",
             "Pt is a 61 years old married, Caucasian, Catholic woman. Pt speaks English reasonably well."]


data = spark.createDataFrame(sample_texts, StringType()).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_sdoh_demographics_wip", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter   
))

val data = Seq("Pt is a 61 years old married, Caucasian, Catholic woman. Pt speaks English reasonably well.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+-----------------+-----+---+------------+
|ner_label        |begin|end|chunk       |
+-----------------+-----+---+------------+
|Gender           |16   |17 |He          |
|Geographic_Entity|43   |47 |Korea       |
|Gender           |0    |1  |He          |
|Family_Member    |29   |36 |children    |
|Age              |8    |19 |61 years old|
|Race_Ethnicity   |30   |38 |Caucasian   |
|Spiritual_Beliefs|41   |48 |Catholic    |
|Gender           |50   |54 |woman       |
|Language         |67   |73 |English     |
+-----------------+-----+---+------------+

Model Information

Model Name: ner_sdoh_demographics_wip
Compatibility: Healthcare NLP 4.2.8+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 858.4 KB

Benchmarking

	    label	     tp	    fp	   fn	   total	precision	   recall	       f1
              Age	 1346.0	  73.0	 74.0	  1420.0	 0.948555	 0.947887	 0.948221
Spiritual_Beliefs	  100.0	  13.0	 16.0	   116.0	 0.884956	 0.862069	 0.873362
    Family_Member	 4468.0	 134.0	 43.0	  4511.0	 0.970882	 0.990468	 0.980577
   Race_Ethnicity	   56.0	   0.0	 13.0	    69.0	 1.000000	 0.811594	 0.896000
           Gender	 9825.0	  67.0	247.0	 10072.0	 0.993227	 0.975477	 0.984272
Geographic_Entity	  225.0	   9.0	 29.0	   254.0	 0.961538	 0.885827	 0.922131
         Language	   51.0	   9.0	  5.0	    56.0	 0.850000	 0.910714	 0.879310