Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts ADDRESS, AMOUNT, DATE, ORGANISATION, and PERSON entities from Romanian documents.
Predicted Entities
ADDRESS, AMOUNT, DATE, ORGANISATION, PERSON
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_ro_cased", "ro")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "ro", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""Or, rezultă din hotărârea Curții de Apel București din 12 iunie 2013 că instanța română a aplicat greșit dreptul Uniunii (32) atunci când a respins excepția de litispendență invocată de domnul Liberato, întemeiată pe cererile referitoare la legătura matrimonială."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
Results
+---------------+---------+
|chunk |ner_label|
+---------------+---------+
|București |ADDRESS |
|12 iunie 2013 |DATE |
|domnul Liberato|PERSON |
+---------------+---------+
Model Information
| Model Name: | legner_mapa |
| Compatibility: | Legal NLP 1.0.0+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token, embeddings] |
| Output Labels: | [ner] |
| Language: | ro |
| Size: | 1.4 MB |
References
The dataset is available here.
Benchmarking
label precision recall f1-score support
ADDRESS 0.88 0.96 0.92 23
AMOUNT 1.00 0.67 0.80 3
DATE 0.97 0.97 0.97 31
ORGANISATION 0.67 0.71 0.69 28
PERSON 0.91 0.83 0.87 48
macro-avg 0.86 0.86 0.86 133
macro-avg 0.88 0.83 0.85 133
weighted-avg 0.87 0.86 0.86 133