Description
This is a small version of NER model that extracts only PER(Person), LOC(Location), ORG(Organization) and DATE entities from Romanian Official Documents.
Predicted Entities
PER
, LOC
, ORG
, DATE
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")\
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols("sentence", "token")\
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = legal.NerModel.pretrained("legner_romanian_official_sm", "ro", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([["""Prezentul ordin se publică în Monitorul Oficial al României, Partea I. Ministrul sănătății, Sorina Pintea București, 28 februarie 2019.""]]).toDF("text")
result = model.transform(data)
Results
+-----------------------------+-----+
|chunk |label|
+-----------------------------+-----+
|Monitorul Oficial al României|ORG |
|Sorina Pintea |PER |
|București |LOC |
|28 februarie 2019 |DATE |
+-----------------------------+-----+
Model Information
Model Name: | legner_romanian_official_sm |
Compatibility: | Legal NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | ro |
Size: | 16.4 MB |
References
Dataset is available here.
Benchmarking
label precision recall f1-score support
DATE 0.87 0.96 0.91 397
LOC 0.87 0.78 0.83 190
ORG 0.90 0.93 0.91 559
PER 0.98 0.93 0.95 108
micro-avg 0.89 0.92 0.90 1254
macro-avg 0.91 0.90 0.90 1254
weighted-avg 0.89 0.92 0.90 1254