Spanish NER for Laws and Treaties/Agreements (Roberta)

Description

Legal Roberta Named Entity Recognition model in Spanish, able to recognize the following entities:

  • LEY: Law
  • TRAT_INTL: International Treaty (Agreement)

This model originally trained on scjn dataset, available here and finetuned on internal documents, improving the coverage of the original version, published here.

Predicted Entities

LAW, TRAT_INTL

Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler() \
       .setInputCol("text") \
       .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("legner_laws_treaties","es", "legal/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("ner")

pipeline = nlp.Pipeline(
    stages=[documentAssembler, 
            sentenceDetector, 
            tokenizer, 
            tokenClassifier])

text = "Sin perjuicio de lo dispuesto en el párrafo b), los requisitos y los efectos de una reivindicación de prioridad presentada conforme al párrafo 1), serán los establecidos en el Artículo 4 del Acta de Estocolmo del Convenio de París para la Protección de la Propiedad Industrial."

data = spark.createDataFrame([[""]]).toDF("text")

fitmodel = pipeline.fit(data)

light_model = LightPipeline(fitmodel)

light_result = light_model.fullAnnotate(text)

chunks = []
entities = []

for n in light_result[0]['ner_chunk']:       
    print("{n.result} ({n.metadata['entity']}))

Results

para la Protección de la Propiedad Industrial. (TRAT_INTL)

Model Information

Model Name: legner_laws_treaties
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [document, token]
Output Labels: [ner]
Language: es
Size: 464.4 MB
Case sensitive: true
Max sentence length: 128

References

This model was originally trained on scjn dataset, available here and finetuned on scrapped documents (as, for example, this one), improving the coverage of the original version, published here.

Benchmarking

        label        prec        rec          f1
Macro-average   0.9361195  0.9294152   0.9368145 
Micro-average   0.9856711  0.9857456   0.9851656