Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts ADDRESS, AMOUNT, DATE, ORGANISATION, and PERSON entities from Italian documents.
Predicted Entities
ADDRESS, AMOUNT, DATE, ORGANISATION, PERSON
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_it_cased", "it")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "it", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""In pendenza del giudizio relativo alla responsabilità genitoriale instaurato in Italia, la sig.ra Grigorescu, il 30 settembre 2009, ha adito la Judecătoria București ( Tribunale di primo grado di Bucarest ) chiedendo il divorzio, l’affidamento esclusivo del figlio e un contributo al mantenimento del figlio a carico del padre a titolo di mantenimento della prole."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
Results
+-----------------+---------+
|chunk |ner_label|
+-----------------+---------+
|Italia |ADDRESS |
|sig.ra Grigorescu|PERSON |
|30 settembre 2009|DATE |
|Bucarest |ADDRESS |
+-----------------+---------+
Model Information
| Model Name: | legner_mapa |
| Compatibility: | Legal NLP 1.0.0+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token, embeddings] |
| Output Labels: | [ner] |
| Language: | it |
| Size: | 1.4 MB |
References
The dataset is available here.
Benchmarking
label precision recall f1-score support
ADDRESS 1.00 1.00 1.00 14
AMOUNT 1.00 1.00 1.00 3
DATE 1.00 1.00 1.00 45
ORGANISATION 0.89 0.89 0.89 9
PERSON 0.92 1.00 0.96 12
macro-avg 0.98 0.99 0.98 83
macro-avg 0.96 0.98 0.97 83
weighted-avg 0.98 0.99 0.98 83