Detect Persons, Locations, Organizations, Dates, Time, Numbers, and Designation Entities in Urdu (urduvec_140M_300d)

Description

This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using urduvec_140M_300d word embeddings, so please use the same embeddings in the pipeline.

Predicted Entities

Persons-PER, Locations-LOC, Organizations-ORG, Dates-DATE, Designations-DESIGNATION, Times-TIME, Numbers-NUMBER.

Download

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

...
ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur" ) \
  .setInputCols(["sentence", "token", "word_embeddings"]) \
  .setOutputCol("ner")
ner_converter = NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔")

...
val ner_model = NerDLModel.pretrained("uner_mk_140M_300d", "ur")
        .setInputCols(Array("sentence", "token", "word_embeddings"))
        .setOutputCol("ner")
val ner_converter = NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔"].toDS.toDF("text")).transform(data)

Results

|    | ner_chunk      | entity       |
|---:|---------------:|-------------:|
|  0 |بریگیڈیئر      | DESIGNATION  |
|  1 |ایڈ بٹلر       | PERSON       |
|  2 |سنہ دوہزارچھ   | DATE         |
|  3 |ہلمند           | LOCATION     |

Model Information

Model Name: uner_mk_140M_300d
Type: ner
Compatibility: Spark NLP 2.7.0+
License: Open Source
Edition: Official
Input Labels: [document, token, word_embeddings]
Output Labels: [ner]
Language: ur
Dependencies: urduvec_140M_300d

Data Source

This model is trained using the following datasets: https://www.researchgate.net/publication/312218764_Named_Entity_Dataset_for_Urdu_Named_Entity_Recognition_Task https://www.researchgate.net/publication/332653135_Urdu_Named_Entity_Recognition_Corpus_Generation_and_Deep_Learning_Applications

Benchmarking

|    | label          |     tp |    fp |   fn |     prec |      rec |       f1 |
|---:|:---------------|-------:|------:|-----:|---------:|---------:|---------:|
|  0 | I-TIME         |     12 |    10 |    1 | 0.545455 | 0.923077 | 0.685714 |
|  1 | B-PERSON       |   2808 |   846 |  535 | 0.768473 | 0.839964 | 0.80263  |
|  2 | B-DATE         |     34 |     6 |    6 | 0.85     | 0.85     | 0.85     |
|  3 | I-DATE         |     45 |     1 |    2 | 0.978261 | 0.957447 | 0.967742 |
|  4 | B-DESIGNATION  |     49 |    30 |   16 | 0.620253 | 0.753846 | 0.680556 |
|  5 | I-LOCATION     |   2110 |   750 |  701 | 0.737762 | 0.750623 | 0.744137 |
|  6 | B-TIME         |     11 |     9 |    3 | 0.55     | 0.785714 | 0.647059 |
|  7 | I-ORGANIZATION |   2006 |   772 |  760 | 0.722102 | 0.725235 | 0.723665 |
|  8 | I-NUMBER       |     18 |     6 |    2 | 0.75     | 0.9      | 0.818182 |
|  9 | B-LOCATION     |   5428 |  1255 |  582 | 0.81221  | 0.903161 | 0.855275 |
| 10 | B-NUMBER       |    194 |    36 |   27 | 0.843478 | 0.877828 | 0.86031  |
| 11 | I-DESIGNATION  |     25 |    15 |    6 | 0.625    | 0.806452 | 0.704225 |
| 12 | I-PERSON       |   3562 |   759 |  433 | 0.824346 | 0.891614 | 0.856662 |
| 13 | B-ORGANIZATION |   1114 |   466 |  641 | 0.705063 | 0.634758 | 0.668066 |
| 14 | Macro-average  | 17416  | 4961  | 3715 | 0.738029 | 0.828551 | 0.780675 |
| 15 | Micro-average  | 17416  | 4961  | 3715 | 0.778299 | 0.824192 | 0.800588 |