Description
The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union.
This model extracts ADDRESS, AMOUNT, DATE, ORGANISATION, and PERSON entities from Czech documents.
Predicted Entities
ADDRESS, AMOUNT, DATE, ORGANISATION, PERSON
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_czech_legal","cs")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_mapa", "cs", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""V roce 2007 uzavřela společnost Alpenrind, dříve S GmbH, se společností Martin-Meat usazenou v Maďarsku smlouvu, podle níž se posledně uvedená společnost zavázala k porcování masa a jeho balení v rozsahu 25 půlek jatečně upravených těl skotu týdně."""]
result = model.transform(spark.createDataFrame([text]).toDF("text"))
Results
+-----------+------------+
|chunk |ner_label |
+-----------+------------+
|2007 |DATE |
|Alpenrind |ORGANISATION|
|Martin-Meat|ORGANISATION|
|Maďarsku |ADDRESS |
|25 půlek |AMOUNT |
+-----------+------------+
Model Information
| Model Name: | legner_mapa |
| Compatibility: | Legal NLP 1.0.0+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token, embeddings] |
| Output Labels: | [ner] |
| Language: | cs |
| Size: | 1.4 MB |
References
The dataset is available here.
Benchmarking
label precision recall f1-score support
ADDRESS 0.80 0.67 0.73 36
AMOUNT 1.00 1.00 1.00 5
DATE 0.98 0.98 0.98 56
ORGANISATION 0.64 0.66 0.65 32
PERSON 0.75 0.82 0.78 66
micro-avg 0.81 0.82 0.81 195
macro-avg 0.83 0.82 0.83 195
weighted-avg 0.81 0.82 0.81 195