Professions & Occupations NER model in Spanish (meddroprof_scielowiki)

Description

NER model that detects professions and occupations in Spanish texts. Trained with the embeddings_scielowiki_300d embeddings, and the same WordEmbeddingsModel is needed in the pipeline.

Predicted Entities

ACTIVIDAD, PROFESION, SITUACION_LABORAL

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
     .setInputCol("text")\
     .setOutputCol("document")

sentence = SentenceDetector() \
     .setInputCols("document") \
     .setOutputCol("sentence")

tokenizer = Tokenizer() \
     .setInputCols("sentence") \
     .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\
     .setInputCols(["document", "token"])\
     .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")\
     .setInputCols(["sentence", "token", "embeddings"])\
     .setOutputCol("ner")

ner_converter = NerConverter() \
     .setInputCols(["sentence", "token", "ner"]) \
     .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
                         document_assembler, 
                         sentence,
                         tokenizer,
                         word_embeddings,
                         clinical_ner,
                         ner_converter])

sample_text = """La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO"""

df = spark.createDataFrame([[sample_text]]).toDF("text")

result = pipeline.fit(df).transform(df)
val document_assembler = new DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")
     
val sentence = new SentenceDetector() 
    .setInputCols("document") 
    .setOutputCol("sentence")
    
val tokenizer = new Tokenizer() 
    .setInputCols("sentence") 
    .setOutputCol("token")
    
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")
    .setInputCols(Array("document", "token"))
    .setOutputCol("word_embeddings")
    
val clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")
    
val ner_converter = new NerConverter() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    
val pipeline = new Pipeline().setStages(Array(document_assembler, 
                                            sentence, 
                                            tokenizer, 
                                            word_embeddings, 
                                            clinical_ner, 
                                            ner_converter))

val data = Seq("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("es.med_ner.scielowiki").predict("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""")

Results

+---------------------------------------+-----------------+
|chunk                                  |ner_label        |
+---------------------------------------+-----------------+
|estudiando 1o ESO                      |SITUACION_LABORAL|
|ATS                                    |PROFESION        |
|trabajan en diferentes centros de salud|PROFESION        |
|estudiando 1o ESO                      |SITUACION_LABORAL|
+---------------------------------------+-----------------+

Model Information

Model Name: meddroprof_scielowiki
Compatibility: Healthcare NLP 4.2.2+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: es
Size: 14.8 MB

References

The model was trained with the MEDDOPROF data set:

The MEDDOPROF corpus is a collection of 1844 clinical cases from over 20 different specialties annotated with professions and employment statuses. The corpus was annotated by a team composed of linguists and clinical experts following specially prepared annotation guidelines, after several cycles of quality control and annotation consistency analysis before annotating the entire dataset. Figure 1 shows a screenshot of a sample manual annotation generated using the brat annotation tool.

Reference:

@article{meddoprof,
    title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2022 on automatic recognition, classification and normalization of professions and occupations from medical texts},
    author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin},
journal = {Procesamiento del Lenguaje Natural},
volume = {67},
    year={2022}
}

Benchmarking

label               precision recall f1-score support
B-ACTIVIDAD         0.82      0.36   0.50     25     
B-PROFESION         0.87      0.75   0.81     634    
B-SITUACION_LABORAL 0.79      0.67   0.72     310    
I-ACTIVIDAD         0.86      0.43   0.57     58     
I-PROFESION         0.87      0.80   0.83     944    
I-SITUACION_LABORAL 0.74      0.71   0.73     407    
O                   1.00      1.00   1.00     139880 
accuracy            -         -      0.99     142258 
macro-avg           0.85      0.67   0.74     142258 
weighted-avg        0.99      0.99   0.99     142258