Extract Social Environment Entities from Social Determinants of Health Texts

Description

SDOH NER model is designed to detect and label social determinants of health (SDOH) social environment entities within text data. Social determinants of health are crucial factors that influence individuals’ health outcomes, encompassing various social, economic, and environmental elements. The model has been trained using advanced machine-learning techniques on a diverse range of text sources. The model’s accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results. Here are the labels of the SDOH NER model with their description:

  • Chidhood_Event: Childhood events mentioned by the patient. “childhood trauma, childhood abuse, etc.”
  • Legal_Issues: Issues that have legal implications. “legal issues, legal problems, detention, in prison, etc.”
  • Social_Exclusion: Absence or lack of rights or accessibility to services or goods that are expected of the majority of the population. “social exclusion, social isolation, gender discrimination, etc.”
  • Social_Support: he presence of friends, family or other people to turn to for comfort or help. “social support, live with family, etc.”
  • Violence_Or_Abuse: Episodes of abuse or violence experienced and reported by the patient. “domestic violence, sexual abuse, etc.”

Predicted Entities

Chidhood_Event, Legal_Issues, Social_Exclusion, Social_Support, Violence_Or_Abuse

Live Demo Open in Colab Copy S3 URI

How to use

from pyspark.sql.types import StringType

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_sdoh_social_environment", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter   
    ])

sample_texts = ["He is the primary caregiver.",
"There is some evidence of abuse.",
"She stated that she was in a safe environment in prison, but that her siblings lived in an unsafe neighborhood, she was very afraid for them and witnessed their ostracism by other people.",
"Medical history: Jane was born in a low - income household and experienced significant trauma during her childhood, including physical abuse and emotional abuse."]

data = spark.createDataFrame(sample_texts, StringType()).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_sdoh_social_environment", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter   
))

val data = Seq(Array("He is the primary caregiver.",
"There is some evidence of abuse.",
"She stated that she was in a safe environment in prison, but that her siblings lived in an unsafe neighborhood, she was very afraid for them and witnessed their ostracism by other people.",
"Medical history: Jane was born in a low - income household and experienced significant trauma during her childhood, including physical abuse and emotional abuse.")).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+---------------------------+-----+---+-----------------+
|chunk                      |begin|end|ner_label        |
+---------------------------+-----+---+-----------------+
|primary caregiver          |10   |26 |Social_Support   |
|abuse                      |26   |30 |Violence_Or_Abuse|
|in prison                  |46   |54 |Legal_Issues     |
|ostracism                  |161  |169|Social_Exclusion |
|trauma during her childhood|87   |113|Chidhood_Event   |
|physical abuse             |126  |139|Violence_Or_Abuse|
|emotional abuse            |145  |159|Violence_Or_Abuse|
+---------------------------+-----+---+-----------------+

Model Information

Model Name: ner_sdoh_social_environment
Compatibility: Healthcare NLP 4.4.4+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 850.8 KB
Dependencies: embeddings_clinical

References

Internal SDOH Project

Benchmarking

            label  precision    recall  f1-score   support
   Chidhood_Event       0.88      0.74      0.81        31
     Legal_Issues       0.86      0.90      0.88        42
 Social_Exclusion       0.85      0.82      0.84        28
   Social_Support       0.95      0.92      0.93       667
Violence_Or_Abuse       0.88      0.81      0.84        89
        micro-avg       0.93      0.90      0.91       857
        macro-avg       0.89      0.84      0.86       857
     weighted-avg       0.93      0.90      0.91       857