Description
AraBert v2 model, trained in house, using the dataset available here and augmenting with financial and legal information.
The entities you can find in this model are:
PERS (person) EVENT CARDINAL NORP (group of people) DATE ORDINAL OCC (occupation) TIME PERCENT ORG (organization) LANGUAGE QUANTITY GPE (geopolitical entity) WEBSITE UNIT LOC (geographical location) LAW MONEY FAC (facility: landmarks places) PRODUCT CURR (currency)
Predicted Entities
NORP, PERS, LOC, MONEY, TIME, ORG, WEBSITE, ORDINAL, PERCENT, EVENT, QUANTITY, OCC, LANGUAGE, CARDINAL, DATE, GPE, PRODUCT, CURR, FAC, UNIT, LAW
How to use
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
tokenClassifier = finance.BertForTokenClassification.pretrained("finner_arabert_arabic", "ar", "finance/models")\
.setInputCols("token", "document")\
.setOutputCol("label")
pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
import pandas as pd
example = spark.createDataFrame(pd.DataFrame({'text': ["""أمثلة:
جامعة بيرزيت وبالتعاون مع مؤسسة ادوارد سعيد تنظم مهرجان للفن الشعبي سيبدأ الساعة الرابعة عصرا، بتاريخ 16/5/2016.
بورصة فلسطين تسجل ارتفاعا بنسبة 0.08% ، في جلسة بلغت قيمة تداولاتها أكثر من نصف مليون دولار .
إنتخاب رئيس هيئة سوق رأس المال وتعديل مادة (4) في القانون الأساسي.
مسيرة قرب باب العامود والذي 700 متر عن المسجد الأقصى."""]}))
result = pipeline.fit(example).transform(example)
Results
["أمثلة:","O"]
["جامعة","B-ORG"]
["بيرزيت","I-ORG"]
["وبالتعاون","O"]
["مع","O"]
["مؤسسة","B-ORG"]
["ادوارد","B-PERS"]
["سعيد","I-PERS"]
["تنظم","O"]
["مهرجان","B-EVENT"]
["للفن","I-EVENT"]
["الشعبي","I-EVENT"]
["سيبدأ","O"]
["الساعة","B-TIME"]
["الرابعة","I-TIME"]
["عصرا،","I-TIME"]
["بتاريخ","B-DATE"]
["16/5/2016.","I-DATE"]
["بورصة","B-ORG"]
["فلسطين","I-ORG"]
["تسجل","O"]
["ارتفاعا","O"]
["بنسبة","O"]
["0.08%","B-PERCENT"]
["،","O"]
["في","O"]
["جلسة","O"]
["بلغت","O"]
["قيمة","O"]
["تداولاتها","O"]
["أكثر","O"]
["من","O"]
["نصف","B-MONEY"]
["مليون","I-MONEY"]
["دولار","B-CURR"]
[".","O"]
["إنتخاب","O"]
["رئيس","B-OCC"]
["هيئة","B-ORG"]
["سوق","I-ORG"]
["رأس","I-ORG"]
["المال","I-ORG"]
["وتعديل","O"]
["مادة","B-LAW"]
["(4)","I-LAW"]
["في","O"]
["القانون","B-LAW"]
["الأساسي.","O"]
["مسيرة","O"]
["قرب","O"]
["باب","B-FAC"]
["العامود","I-FAC"]
["والذي","O"]
["700","B-QUANTITY"]
["متر","B-UNIT"]
["عن","O"]
["المسجد","B-FAC"]
["الأقصى.","I-FAC"]
Model Information
| Model Name: | finner_arabert_arabic |
| Type: | finance |
| Compatibility: | Finance NLP 1.0.0+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token] |
| Output Labels: | [ner] |
| Language: | ar |
| Size: | 505.7 MB |
| Case sensitive: | true |
| Max sentence length: | 512 |
References
https://ontology.birzeit.edu/Wojood/
Benchmarking
label precision recall f1-score support
B-CARDINAL 0.93 0.87 0.80 19
B-DATE 0.88 0.93 0.90 106
B-EVENT 1.00 0.86 0.92 14
B-FAC 1.00 0.67 0.80 3
B-GPE 0.88 0.85 0.87 89
B-LAW 1.00 0.50 0.67 6
B-NORP 0.72 0.81 0.76 32
B-OCC 0.88 0.83 0.85 52
B-ORDINAL 0.76 0.80 0.78 35
B-ORG 0.81 0.87 0.84 103
B-PERS 0.78 0.89 0.83 47
B-WEBSITE 0.62 1.00 0.77 5
I-CARDINAL 0.33 0.62 0.43 8
I-DATE 0.98 0.99 0.98 447
I-EVENT 0.91 0.91 0.91 23
I-FAC 0.75 0.43 0.55 7
I-GPE 0.80 0.92 0.86 53
I-LAW 1.00 0.85 0.92 13
I-NORP 0.65 0.48 0.55 23
I-OCC 0.97 0.86 0.91 96
I-ORG 0.87 0.91 0.89 139
I-PERS 0.94 1.00 0.97 60
I-WEBSITE 0.94 1.00 0.97 15
O 0.98 0.97 0.98 3062
accuracy - - 0.95 4468
macro-avg 0.83 0.81 0.81 4468
weighted-avg 0.95 0.95 0.95 4468