Professions & Occupations NER model in Spanish (meddroprof_scielowiki)

Description

NER model that detects professions and occupations in Spanish texts. Trained with the embeddings_scielowiki_300d embeddings, and the same WordEmbeddingsModel is needed in the pipeline.

Predicted Entities

ACTIVIDAD, PROFESION, SITUACION_LABORAL

Live Demo Open in Colab Download

How to use

document_assembler = DocumentAssembler()\
        .setInputCol('text')\
        .setOutputCol('document')

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embedings_stage = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\
        .setInputCols(["document","token"])\
        .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('ner_chunk')

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence,
    tokenizer,
    embedings_stage,
    clinical_ner,
    ner_converter
])

example = spark.createDataFrame(pd.DataFrame({'text': ["""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO"""]}))

result = pipeline.fit(example).transform(example)
val document_assembler = new DocumentAssembler()
        .setInputCol('text')
        .setOutputCol('document')

val sentence = new SentenceDetector() 
    .setInputCols(["document"]) 
    .setOutputCol("sentence")

val tokenizer = new Tokenizer() 
    .setInputCols(["sentence"]) 
    .setOutputCol("token")

val embedings_stage = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")
        .setInputCols(["document","token"])
        .setOutputCol("word_embeddings")

val clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")
        .setInputCols(["sentence", "token", "embeddings"])
        .setOutputCol("ner")

val ner_converter = new NerConverter() 
    .setInputCols(['document', 'token', 'ner']) 
    .setOutputCol('ner_chunk')

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence, tokenizer, embedings_stage, clinical_ner, ner_converter))

val data = Seq("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+--------------------+----------+-------------------+
|                text|     token|         prediction|
+--------------------+----------+-------------------+
|La paciente es la...|        La|                  O|
|La paciente es la...|  paciente|                  O|
|La paciente es la...|        es|                  O|
|La paciente es la...|        la|                  O|
|La paciente es la...|     mayor|                  O|
|La paciente es la...|        de|                  O|
|La paciente es la...|         2|                  O|
|La paciente es la...|  hermanos|                  O|
|La paciente es la...|         ,|                  O|
|La paciente es la...|     tiene|                  O|
|La paciente es la...|        un|                  O|
|La paciente es la...|   hermano|                  O|
|La paciente es la...|        de|                  O|
|La paciente es la...|        13|                  O|
|La paciente es la...|      años|                  O|
|La paciente es la...|estudiando|B-SITUACION_LABORAL|
|La paciente es la...|        1o|I-SITUACION_LABORAL|
|La paciente es la...|       ESO|I-SITUACION_LABORAL|
|La paciente es la...|         .|                  O|
|La paciente es la...|       Sus|                  O|
|La paciente es la...|    padres|                  O|
|La paciente es la...|       son|                  O|
|La paciente es la...|     ambos|                  O|
|La paciente es la...|       ATS|        B-PROFESION|
|La paciente es la...|         ,|                  O|
|La paciente es la...|  trabajan|        B-PROFESION|
|La paciente es la...|        en|        I-PROFESION|
|La paciente es la...|diferentes|        I-PROFESION|
|La paciente es la...|   centros|        I-PROFESION|
|La paciente es la...|        de|        I-PROFESION|
|La paciente es la...|     salud|        I-PROFESION|
|La paciente es la...|estudiando|B-SITUACION_LABORAL|
|La paciente es la...|        1o|I-SITUACION_LABORAL|
|La paciente es la...|       ESO|I-SITUACION_LABORAL|
+--------------------+----------+-------------------+

Model Information

Model Name: meddroprof_scielowiki
Compatibility: Spark NLP for Healthcare 3.1.3+
License: Licensed
Edition: Official
Input Labels: [sentence, token, word_embeddings]
Output Labels: [ner]
Language: es
Dependencies: embeddings_scielowiki_300d

Data Source

The model was trained with the MEDDOPROF data set:

The MEDDOPROF corpus is a collection of 1844 clinical cases from over 20 different specialties annotated with professions and employment statuses. The corpus was annotated by a team composed of linguists and clinical experts following specially prepared annotation guidelines, after several cycles of quality control and annotation consistency analysis before annotating the entire dataset. Figure 1 shows a screenshot of a sample manual annotation generated using the brat annotation tool.

Reference:

@article{meddoprof,
    title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts},
    author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin},
journal = {Procesamiento del Lenguaje Natural},
volume = {67},
    year={2021}
}

Benchmarking

label               precision recall f1-score support
B-ACTIVIDAD         0.82      0.36   0.50     25     
B-PROFESION         0.87      0.75   0.81     634    
B-SITUACION_LABORAL 0.79      0.67   0.72     310    
I-ACTIVIDAD         0.86      0.43   0.57     58     
I-PROFESION         0.87      0.80   0.83     944    
I-SITUACION_LABORAL 0.74      0.71   0.73     407    
O                   1.00      1.00   1.00     139880 
accuracy            -         -      0.99     142258 
macro-avg           0.85      0.67   0.74     142258 
weighted-avg        0.99      0.99   0.99     142258