Description
Pretrained Spanish Named Entity Recognition model for detecting laws and monetary ammounts. This model was trained in-house and available annotations of this dataset and weak labelling from this model
Predicted Entities
LAW
, MONEY
How to use
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("legner_law_money", "es", "legal/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("ner")
pipeline = nlp.Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier])
text = "La recaudación del ministerio del interior fue de 20,000,000 euros así constatado por el artículo 24 de la Constitución Española."
data = spark.createDataFrame([[""]]).toDF("text")
fitmodel = pipeline.fit(data)
light_model = LightPipeline(fitmodel)
light_result = light_model.fullAnnotate(text)
chunks = []
entities = []
for n in light_result[0]['ner_chunk']:
print("{n.result} ({n.metadata['entity']}))
Results
20,000,000 euros (MONEY)
artículo 24 de la Constitución Española (LAW)
Model Information
Model Name: | legner_law_money |
Compatibility: | Legal NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [document, token] |
Output Labels: | [ner] |
Language: | es |
Size: | 414.2 MB |
Case sensitive: | true |
Max sentence length: | 128 |
References
This model was trained in-house and available annotations of this dataset and weak labelling from this model
Benchmarking
label precision recall f1-score support
LAW 0.95 0.96 0.96 20
MONEY 0.98 0.99 0.99 106
accuracy - - 0.98 126
macro-avg 0.97 0.98 0.97 126
weighted-avg 0.98 0.99 0.99 126