Description
This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using urduvec_140M_300d
word embeddings, so please use the same embeddings in the pipeline.
Predicted Entities
Persons-PER
, Locations-LOC
, Organizations-ORG
, Dates-DATE
, Designations-DESIGNATION
, Times-TIME
, Numbers-NUMBER
.
How to use
Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
...
word_embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur" ) \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔")
...
val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("uner_mk_140M_300d", "ur")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔"].toDS.toDF("text")).transform(data)
Results
| | ner_chunk | entity |
|---:|---------------:|-------------:|
| 0 |بریگیڈیئر | DESIGNATION |
| 1 |ایڈ بٹلر | PERSON |
| 2 |سنہ دوہزارچھ | DATE |
| 3 |ہلمند | LOCATION |
Model Information
Model Name: | uner_mk_140M_300d |
Type: | ner |
Compatibility: | Spark NLP 2.7.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document, token, word_embeddings] |
Output Labels: | [ner] |
Language: | ur |
Dependencies: | urduvec_140M_300d |
Data Source
This model is trained using the following datasets: https://www.researchgate.net/publication/312218764_Named_Entity_Dataset_for_Urdu_Named_Entity_Recognition_Task https://www.researchgate.net/publication/332653135_Urdu_Named_Entity_Recognition_Corpus_Generation_and_Deep_Learning_Applications
Benchmarking
| | label | tp | fp | fn | prec | rec | f1 |
|---:|:---------------|-------:|------:|-----:|---------:|---------:|---------:|
| 0 | I-TIME | 12 | 10 | 1 | 0.545455 | 0.923077 | 0.685714 |
| 1 | B-PERSON | 2808 | 846 | 535 | 0.768473 | 0.839964 | 0.80263 |
| 2 | B-DATE | 34 | 6 | 6 | 0.85 | 0.85 | 0.85 |
| 3 | I-DATE | 45 | 1 | 2 | 0.978261 | 0.957447 | 0.967742 |
| 4 | B-DESIGNATION | 49 | 30 | 16 | 0.620253 | 0.753846 | 0.680556 |
| 5 | I-LOCATION | 2110 | 750 | 701 | 0.737762 | 0.750623 | 0.744137 |
| 6 | B-TIME | 11 | 9 | 3 | 0.55 | 0.785714 | 0.647059 |
| 7 | I-ORGANIZATION | 2006 | 772 | 760 | 0.722102 | 0.725235 | 0.723665 |
| 8 | I-NUMBER | 18 | 6 | 2 | 0.75 | 0.9 | 0.818182 |
| 9 | B-LOCATION | 5428 | 1255 | 582 | 0.81221 | 0.903161 | 0.855275 |
| 10 | B-NUMBER | 194 | 36 | 27 | 0.843478 | 0.877828 | 0.86031 |
| 11 | I-DESIGNATION | 25 | 15 | 6 | 0.625 | 0.806452 | 0.704225 |
| 12 | I-PERSON | 3562 | 759 | 433 | 0.824346 | 0.891614 | 0.856662 |
| 13 | B-ORGANIZATION | 1114 | 466 | 641 | 0.705063 | 0.634758 | 0.668066 |
| 14 | Macro-average | 17416 | 4961 | 3715 | 0.738029 | 0.828551 | 0.780675 |
| 15 | Micro-average | 17416 | 4961 | 3715 | 0.778299 | 0.824192 | 0.800588 |