Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts ADDRESS
, AMOUNT
, DATE
, ORGANISATION
, and PERSON
entities from Portuguese
documents.
Predicted Entities
ADDRESS
, AMOUNT
, DATE
, ORGANISATION
, PERSON
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_pt_cased", "pt")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "pt", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""Nos termos dos Decretos da Garda Síochána (6), só pode ser admitido como estagiário para integrar a força policial nacional quem tiver pelo menos 18 anos, mas menos de 35 anos de idade, no primeiro dia do mês em que tenha sido publicado pela primeira vez, num jornal nacional, o anúncio da vaga a que o recrutamento respeita."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
Results
+-----------------------+------------+
|chunk |ner_label |
+-----------------------+------------+
|Garda Síochána |ORGANISATION|
|força policial nacional|ORGANISATION|
|18 anos |AMOUNT |
|35 anos |AMOUNT |
+-----------------------+------------+
Model Information
Model Name: | legner_mapa |
Compatibility: | Legal NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | pt |
Size: | 1.4 MB |
References
The dataset is available here.
Benchmarking
label precision recall f1-score support
ADDRESS 0.91 0.91 0.91 23
AMOUNT 1.00 0.83 0.91 6
DATE 1.00 0.95 0.97 61
ORGANISATION 0.85 0.77 0.81 30
PERSON 0.88 0.91 0.89 65
macro-avg 0.92 0.90 0.91 185
macro-avg 0.93 0.87 0.90 185
weighted-avg 0.92 0.90 0.91 185