Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts ADDRESS, AMOUNT, DATE, ORGANISATION, and PERSON entities from Danish documents.
Predicted Entities
ADDRESS, AMOUNT, DATE, ORGANISATION, PERSON
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_da_cased", "da")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "da", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""Fra den 1. februar 2012 til den 31. januar 2014, og således også under den omtvistede periode, blev arbejdstagere hos Martimpex udsendt til Østrig for at udføre det samme arbejde."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
Results
+---------------+------------+
|chunk |ner_label |
+---------------+------------+
|1. februar 2012|DATE |
|31. januar 2014|DATE |
|Martimpex |ORGANISATION|
|Østrig |ADDRESS |
+---------------+------------+
Model Information
| Model Name: | legner_mapa |
| Compatibility: | Legal NLP 1.0.0+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token, embeddings] |
| Output Labels: | [ner] |
| Language: | da |
| Size: | 1.4 MB |
References
The dataset is available here.
Benchmarking
label precision recall f1-score support
ADDRESS 0.95 0.90 0.93 21
AMOUNT 1.00 1.00 1.00 4
DATE 0.98 0.98 0.98 54
ORGANISATION 0.74 0.74 0.74 31
PERSON 0.79 0.86 0.82 43
macro-avg 0.87 0.89 0.88 153
macro-avg 0.89 0.90 0.89 153
weighted-avg 0.87 0.89 0.88 153