Detect Person, Location, Organization, and Miscellaneous entities in Arabic (ANERcorp)

Description

This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using arabic_w2v_cc_300d word embeddings, so please use the same embeddings in the pipeline.

Predicted Entities

Persons-PER, Locations-LOC, Organizations-ORG, Miscellaneous-MISC.

Download

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

...
ner = NerDLModel.pretrained("aner_cc_300d", "ar") \
   .setInputCols(["sentence", "token", "word_embeddings"]) \
   .setOutputCol("ner")
ner_converter = NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز")

...
val ner_model = NerDLModel.pretrained("aner_cc_300d", "ar")
        .setInputCols(Array("sentence", "token", "word_embeddings"))
        .setOutputCol("ner")
val ner_converter = NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز"].toDS.toDF("text")).transform(data)

Results

|    | ner_chunk                | entity       |
|---:|-------------------------:|-------------:|
|  0 | قوات الثورة العربية    | ORG          |
|  1 | دمشق                    | LOC          |
|  2 | الإنكليز                 | PER          |

Model Information

Model Name: aner_cc_300d
Type: ner
Compatibility: Spark NLP 2.7.0+
License: Open Source
Edition: Official
Input Labels: [document, token, word_embeddings]
Output Labels: [ner]
Language: ar
Dependencies: arabic_w2v_cc_300d

Data Source

This model is trained on data obtained from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp

Benchmarking

|    | label          |     tp |    fp |   fn |     prec |      rec |       f1 |
|---:|:---------------|-------:|------:|-----:|---------:|---------:|---------:|
|  0 | B-LOC          |    163 |    28 |   34 | 0.853403 | 0.827411 | 0.840206 |
|  1 | I-ORG          |     60 |    10 |    5 | 0.857142 | 0.923077 | 0.888889 |
|  2 | I-MIS          |    124 |    53 |   53 | 0.700565 | 0.700565 | 0.700565 |
|  3 | I-LOC          |     64 |    20 |   23 | 0.761904 | 0.735632 | 0.748538 |
|  4 | B-MIS          |    297 |    71 |   52 | 0.807065 | 0.851003 | 0.828452 |
|  5 | I-PER          |     84 |    23 |   13 | 0.785046 | 0.865979 | 0.823530 |
|  6 | B-ORG          |     54 |     9 |   12 | 0.857142 | 0.818181 | 0.837210 |
|  7 | B-PER          |    182 |    26 |   33 | 0.875    | 0.846512 | 0.860520 |
|  8 | Macro-average  |   1028 |   240 |  225 | 0.812159 | 0.821045 | 0.816578 |
|  9 | Micro-average  |   1028 |   240 |  225 | 0.810726 | 0.820431 | 0.815550 |