Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts ADDRESS, AMOUNT, DATE, ORGANISATION, and PERSON entities from German documents.
Predicted Entities
ADDRESS, AMOUNT, DATE, ORGANISATION, PERSON
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_de_cased", "de")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "de", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""Herr Liberato und Frau Grigorescu heirateten am 22 Oktober 2005 in Rom (Italien) und lebten in diesem Mitgliedstaat bis zur Geburt ihres Kindes am 20 Februar 2006 zusammen."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
Results
+----------------+---------+
|chunk |ner_label|
+----------------+---------+
|Herr Liberato |PERSON |
|Frau Grigorescu |PERSON |
|22 Oktober 2005|DATE |
|Rom (Italien) |ADDRESS |
|20 Februar 2006 |DATE |
+----------------+---------+
Model Information
| Model Name: | legner_mapa |
| Compatibility: | Legal NLP 1.0.0+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token, embeddings] |
| Output Labels: | [ner] |
| Language: | de |
| Size: | 1.4 MB |
References
The dataset is available here.
Benchmarking
label precision recall f1-score support
ADDRESS 0.69 0.85 0.76 13
AMOUNT 1.00 0.75 0.86 4
DATE 0.92 0.93 0.93 61
ORGANISATION 0.64 0.77 0.70 30
PERSON 0.85 0.87 0.86 46
macro-avg 0.82 0.87 0.84 154
macro-avg 0.82 0.83 0.82 154
weighted-avg 0.83 0.87 0.85 154