Spanish NER Model

Description

  • This model is imported from Hugging Face.

  • RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the RoBERTa base model and has been pretrained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain (Biblioteca Nacional de España) from 2009 to 2019.

Predicted Entities

OTH, PER, LOC, ORG

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_bne_capitel_ner", "es"))\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("ner")

ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)
text = """Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid."""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
       .setInputCols(Array("document"))
       .setOutputCol("sentence")

val tokenizer = Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_bne_capitel_ner", "es"))
  .setInputCols(Array("sentence","token"))
  .setOutputCol("ner")

ner_converter = NerConverter()
      .setInputCols(Array("sentence", "token", "ner"))
      .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid."].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("es.ner.roberta").predict("""Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.""")

Results

+------------------------+---------+
|chunk                   |ner_label|
+------------------------+---------+
|Antonio                 |PER      |
|fábrica de Mercedes-Benz|ORG      |
|Madrid.                 |LOC      |
+------------------------+---------+

Model Information

Model Name: roberta_token_classifier_bne_capitel_ner
Compatibility: Spark NLP 3.3.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: es
Case sensitive: true
Max sentense length: 256

Data Source

https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner-plus

Benchmarking

label   score
   f1   0.8867