Named Entity Recognition in Romanian Official Documents (Small)

Description

This is a small version of NER model that extracts only PER(Person), LOC(Location), ORG(Organization) and DATE entities from Romanian Official Documents.

Predicted Entities

PER, LOC, ORG, DATE

Live Demo Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")\

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "ro")\
    .setInputCols("sentence", "token")\
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)\
    .setCaseSensitive(True)

ner_model = legal.NerModel.pretrained("legner_romanian_official_sm", "ro", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")\

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
    ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

data = spark.createDataFrame([["""Prezentul ordin se publică în Monitorul Oficial al României, Partea I. Ministrul sănătății, Sorina Pintea București, 28 februarie 2019.""]]).toDF("text")
                             
result = model.transform(data)

Results

+-----------------------------+-----+
|chunk                        |label|
+-----------------------------+-----+
|Monitorul Oficial al României|ORG  |
|Sorina Pintea                |PER  |
|București                    |LOC  |
|28 februarie 2019            |DATE |
+-----------------------------+-----+

Model Information

Model Name: legner_romanian_official_sm
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: ro
Size: 16.4 MB

References

Dataset is available here.

Benchmarking


label         precision  recall  f1-score  support 
DATE          0.87       0.96    0.91      397     
LOC           0.87       0.78    0.83      190     
ORG           0.90       0.93    0.91      559     
PER           0.98       0.93    0.95      108     
micro-avg     0.89       0.92    0.90      1254    
macro-avg     0.91       0.90    0.90      1254    
weighted-avg  0.89       0.92    0.90      1254