Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts ADDRESS
, AMOUNT
, DATE
, ORGANISATION
, and PERSON
entities from Greek
documents.
Predicted Entities
ADDRESS
, AMOUNT
, DATE
, ORGANISATION
, PERSON
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_el_cased", "el")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "el", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""86 Στην υπόθεση της κύριας δίκης, προκύπτει ότι ορισμένοι εργαζόμενοι της Martin‑Meat αποσπάσθηκαν στην Αυστρία κατά την περίοδο μεταξύ του έτους 2007 και του έτους 2012, για την εκτέλεση εργασιών τεμαχισμού κρέατος σε εγκαταστάσεις της Alpenrind."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
Results
+-----------+------------+
|chunk |ner_label |
+-----------+------------+
|Martin‑Meat|ORGANISATION|
|Αυστρία |ADDRESS |
|2007 |DATE |
|2012 |DATE |
|Alpenrind |ORGANISATION|
+-----------+------------+
Model Information
Model Name: | legner_mapa |
Compatibility: | Legal NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | el |
Size: | 16.4 MB |
References
The dataset is available here.
Benchmarking
label precision recall f1-score support
ADDRESS 0.89 1.00 0.94 16
AMOUNT 0.82 0.75 0.78 12
DATE 0.98 0.98 0.98 65
ORGANISATION 0.85 0.85 0.85 40
PERSON 0.90 0.95 0.92 38
macro-avg 0.91 0.93 0.92 171
macro-avg 0.89 0.91 0.90 171
weighted-avg 0.91 0.93 0.92 171