Detect Persons, Locations, Organizations and Misc Entities in English

Description

This NER model trained with GloVe 100d word embeddings, annotates text to find features like the names of people , places and organizations.

nerdl_model = NerDLModel.pretrained("Ner_conll2003_100d", "en", "@gokhanturer")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

Predicted Entities

PER, LOC, ORG, MISC

Open in Colab Download Copy S3 URI

How to use

Open In Colab
JOHNSNOWLABS_LOGO.png
Colab Setup
In [1]:
! pip install -q pyspark==3.1.2 spark-nlp

! pip install -q spark-nlp-display
     |████████████████████████████████| 212.4 MB 82 kB/s 
     |████████████████████████████████| 140 kB 61.0 MB/s 
     |████████████████████████████████| 198 kB 73.0 MB/s 
  Building wheel for pyspark (setup.py) ... done
     |████████████████████████████████| 95 kB 2.0 MB/s 
     |████████████████████████████████| 66 kB 3.5 MB/s 
In [3]:
import sparknlp

spark = sparknlp.start(gpu = True) 

from sparknlp.base import *
from sparknlp.annotator import *
import pyspark.sql.functions as F
from sparknlp.training import CoNLL

print("Spark NLP version", sparknlp.version())

print("Apache Spark version:", spark.version)

spark
Spark NLP version 3.4.0
Apache Spark version: 3.1.2
Out[3]:
SparkSession - in-memory

SparkContext

Spark UI

Versionv3.1.2Masterlocal[*]AppNameSpark NLP
CONLL Data Prep
In [2]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa
Train Data
In [5]:
with open ("eng.train") as f:
  train_data = f.read()
print (train_data[:500])
-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O

The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
said VBD B-VP O
on IN B-PP O
Thursday NNP B-NP O
it PRP B-NP O
disagreed VBD B-VP O
with IN B-PP O
German JJ B-NP B-MISC
advice NN I-NP O
to TO B-PP O
consumers NNS B-NP
In [6]:
train_data = CoNLL().readDataset(spark, 'eng.train')

train_data.show(3)
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows

In [7]:
train_data.count()
Out[7]:
14041
In [8]:
train_data.select(F.explode(F.arrays_zip('token.result', 'pos.result',  'label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("pos"),
        F.expr("cols['2']").alias("ner_label")).show(truncate=50)
+----------+---+---------+
|     token|pos|ner_label|
+----------+---+---------+
|        EU|NNP|    B-ORG|
|   rejects|VBZ|        O|
|    German| JJ|   B-MISC|
|      call| NN|        O|
|        to| TO|        O|
|   boycott| VB|        O|
|   British| JJ|   B-MISC|
|      lamb| NN|        O|
|         .|  .|        O|
|     Peter|NNP|    B-PER|
| Blackburn|NNP|    I-PER|
|  BRUSSELS|NNP|    B-LOC|
|1996-08-22| CD|        O|
|       The| DT|        O|
|  European|NNP|    B-ORG|
|Commission|NNP|    I-ORG|
|      said|VBD|        O|
|        on| IN|        O|
|  Thursday|NNP|        O|
|        it|PRP|        O|
+----------+---+---------+
only showing top 20 rows

In [9]:
train_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False)
+------------+------+
|ground_truth|count |
+------------+------+
|O           |169578|
|B-LOC       |7140  |
|B-PER       |6600  |
|B-ORG       |6321  |
|I-PER       |4528  |
|I-ORG       |3704  |
|B-MISC      |3438  |
|I-LOC       |1157  |
|I-MISC      |1155  |
+------------+------+

In [10]:
#conll_data.select(F.countDistinct("label.result")).show()
#conll_data.groupBy("label.result").count().show(truncate=False)

train_data = train_data.withColumn('unique', F.array_distinct("label.result"))\
                       .withColumn('c', F.size('unique'))\
                       .filter(F.col('c')>1)

train_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
          .select(F.expr("cols['0']").alias("token"),
                  F.expr("cols['1']").alias("ground_truth"))\
          .groupBy('ground_truth')\
          .count()\
          .orderBy('count', ascending=False)\
          .show(100,truncate=False)
+------------+------+
|ground_truth|count |
+------------+------+
|O           |137736|
|B-LOC       |7125  |
|B-PER       |6596  |
|B-ORG       |6288  |
|I-PER       |4528  |
|I-ORG       |3704  |
|B-MISC      |3437  |
|I-LOC       |1157  |
|I-MISC      |1155  |
+------------+------+

Test Data
In [11]:
with open ("eng.testa") as f:
  test_data = f.read()
print (test_data[:500])
-DOCSTART- -X- -X- O

CRICKET NNP B-NP O
- : O O
LEICESTERSHIRE NNP B-NP B-ORG
TAKE NNP I-NP O
OVER IN B-PP O
AT NNP B-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O

LONDON NNP B-NP B-LOC
1996-08-30 CD I-NP O

West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP B-PER
Simmons NNP I-NP I-PER
took VBD B-VP O
four CD B-NP O
for IN B-PP O
38 CD B-NP O
on IN B-PP O
Friday NNP B-NP O
as IN B-PP O
Leicestershire NNP B-NP B-ORG
beat VBD B-VP
In [12]:
test_data = CoNLL().readDataset(spark, 'eng.testa')
test_data.show(3)
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|
|   LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows

In [13]:
test_data.count()
Out[13]:
3250
In [14]:
test_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False)
+------------+-----+
|ground_truth|count|
+------------+-----+
|O           |42759|
|B-PER       |1842 |
|B-LOC       |1837 |
|B-ORG       |1341 |
|I-PER       |1307 |
|B-MISC      |922  |
|I-ORG       |751  |
|I-MISC      |346  |
|I-LOC       |257  |
+------------+-----+

NERDL Model with Glove_100d
In [15]:
glove_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
In [16]:
glove_embeddings.transform(test_data).write.parquet('test_data_embeddings.parquet')
In [17]:
nerTagger = NerDLApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(8)\
    .setLr(0.002)\
    .setDropout(0.5)\
    .setBatchSize(16)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setTestDataset('test_data_embeddings.parquet')\
    .setEnableMemoryOptimizer(False)

ner_pipeline = Pipeline(stages=[
      glove_embeddings,
      nerTagger
])
In [19]:
%%time

ner_model = ner_pipeline.fit(train_data)
CPU times: user 10.6 s, sys: 1.08 s, total: 11.7 s
Wall time: 35min 21s
In [20]:
!cd ~/annotator_logs/ && ls -lt
total 16
-rw-r--r-- 1 root root 13178 Feb  6 17:05 NerDLApproach_c5bf4e4c6211.log
In [21]:
!cat ~/annotator_logs/NerDLApproach_c5bf4e4c6211.log
Name of the selected graph: ner-dl/blstm_10_100_128_120.pb
Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079


Epoch 1/8 started, lr: 0.002, dataset size: 11079


Epoch 1/8 - 159.93s - loss: 2234.436 - batches: 695
Quality on test dataset: 
time to finish evaluation: 17.74s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1695	 94	 142	 0.94745666	 0.92270005	 0.93491447
I-ORG	 528	 76	 223	 0.8741722	 0.7030626	 0.77933586
I-MISC	 255	 88	 91	 0.7434402	 0.7369942	 0.74020314
I-LOC	 189	 14	 68	 0.9310345	 0.73540854	 0.8217391
I-PER	 1270	 59	 37	 0.95560575	 0.9716909	 0.9635812
B-MISC	 797	 142	 125	 0.84877527	 0.8644252	 0.85652876
B-ORG	 1139	 170	 202	 0.8701299	 0.8493661	 0.85962266
B-PER	 1802	 176	 40	 0.91102123	 0.9782845	 0.94345546
tp: 7675 fp: 819 fn: 928 labels: 8
Macro-average	 prec: 0.8852045, rec: 0.84524155, f1: 0.86476153
Micro-average	 prec: 0.903579, rec: 0.8921307, f1: 0.8978184


Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079
Name of the selected graph: ner-dl/blstm_10_100_128_120.pb
Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079


Epoch 1/8 started, lr: 0.002, dataset size: 11079


Epoch 2/8 - 246.28s - loss: 839.1736 - batches: 695
Quality on test dataset: 
time to finish evaluation: 19.66s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1762	 124	 75	 0.9342524	 0.95917255	 0.9465484
I-ORG	 585	 76	 166	 0.8850227	 0.77896136	 0.82861185
I-MISC	 247	 39	 99	 0.8636364	 0.71387285	 0.7816456
I-LOC	 233	 74	 24	 0.7589577	 0.9066148	 0.8262412
I-PER	 1275	 54	 32	 0.95936793	 0.97551644	 0.9673748
B-MISC	 791	 70	 131	 0.9186992	 0.85791755	 0.88726866
B-ORG	 1150	 151	 191	 0.88393545	 0.857569	 0.8705526
B-PER	 1800	 147	 42	 0.9244992	 0.9771987	 0.9501188
tp: 7843 fp: 735 fn: 760 labels: 8
Macro-average	 prec: 0.89104635, rec: 0.8783529, f1: 0.88465416
Micro-average	 prec: 0.9143157, rec: 0.9116587, f1: 0.9129852


Epoch 3/8 started, lr: 0.001980198, dataset size: 11079


Epoch 1/8 - 254.10s - loss: 2203.116 - batches: 695
Quality on test dataset: 
time to finish evaluation: 22.30s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1660	 82	 177	 0.95292765	 0.90364724	 0.9276334
I-ORG	 560	 123	 191	 0.81991214	 0.74567246	 0.781032
I-MISC	 227	 65	 119	 0.7773973	 0.65606934	 0.7115987
I-LOC	 155	 10	 102	 0.93939394	 0.6031128	 0.73459715
I-PER	 1259	 60	 48	 0.954511	 0.96327466	 0.9588728
B-MISC	 762	 110	 160	 0.8738532	 0.82646424	 0.8494984
B-ORG	 1160	 237	 181	 0.83035076	 0.8650261	 0.8473338
B-PER	 1785	 170	 57	 0.9130435	 0.96905535	 0.94021595
tp: 7568 fp: 857 fn: 1035 labels: 8
Macro-average	 prec: 0.88267374, rec: 0.81654024, f1: 0.84832007
Micro-average	 prec: 0.89827895, rec: 0.87969315, f1: 0.8888889


Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079


Epoch 3/8 - 257.88s - loss: 610.81525 - batches: 695
Quality on test dataset: 
time to finish evaluation: 18.07s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1764	 104	 73	 0.9443255	 0.9602613	 0.9522267
I-ORG	 640	 140	 111	 0.82051283	 0.85219705	 0.8360548
I-MISC	 227	 22	 119	 0.9116466	 0.65606934	 0.7630252
I-LOC	 223	 43	 34	 0.8383459	 0.8677043	 0.8527725
I-PER	 1265	 31	 42	 0.97608024	 0.96786535	 0.9719554
B-MISC	 785	 62	 137	 0.9268005	 0.85141	 0.8875071
B-ORG	 1207	 174	 134	 0.87400436	 0.90007454	 0.8868479
B-PER	 1795	 94	 47	 0.9502382	 0.97448426	 0.96220857
tp: 7906 fp: 670 fn: 697 labels: 8
Macro-average	 prec: 0.90524435, rec: 0.87875825, f1: 0.8918047
Micro-average	 prec: 0.921875, rec: 0.91898173, f1: 0.9204261


Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079


Epoch 2/8 - 252.19s - loss: 828.8285 - batches: 695
Quality on test dataset: 
time to finish evaluation: 17.37s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1722	 67	 115	 0.9625489	 0.93739796	 0.9498069
I-ORG	 624	 134	 127	 0.823219	 0.83089215	 0.8270378
I-MISC	 230	 30	 116	 0.88461536	 0.6647399	 0.75907594
I-LOC	 199	 13	 58	 0.9386792	 0.77431905	 0.8486141
I-PER	 1274	 44	 33	 0.9666161	 0.97475135	 0.9706667
B-MISC	 787	 70	 135	 0.9183197	 0.85357916	 0.8847667
B-ORG	 1212	 204	 129	 0.8559322	 0.9038031	 0.87921643
B-PER	 1807	 109	 35	 0.94311064	 0.98099893	 0.9616817
tp: 7855 fp: 671 fn: 748 labels: 8
Macro-average	 prec: 0.9116301, rec: 0.8650602, f1: 0.88773483
Micro-average	 prec: 0.9212996, rec: 0.9130536, f1: 0.917158


Epoch 3/8 started, lr: 0.001980198, dataset size: 11079


Epoch 4/8 - 250.45s - loss: 512.68085 - batches: 695
Quality on test dataset: 
time to finish evaluation: 17.78s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1767	 78	 70	 0.95772356	 0.9618944	 0.9598045
I-ORG	 658	 75	 93	 0.89768076	 0.8761651	 0.8867924
I-MISC	 257	 38	 89	 0.87118644	 0.74277455	 0.801872
I-LOC	 229	 18	 28	 0.9271255	 0.8910506	 0.9087302
I-PER	 1264	 21	 43	 0.9836576	 0.9671002	 0.97530866
B-MISC	 841	 127	 81	 0.86880165	 0.9121475	 0.8899471
B-ORG	 1202	 114	 139	 0.9133739	 0.89634603	 0.90477985
B-PER	 1799	 87	 43	 0.95387065	 0.9766558	 0.9651288
tp: 8017 fp: 558 fn: 586 labels: 8
Macro-average	 prec: 0.92167753, rec: 0.9030168, f1: 0.9122517
Micro-average	 prec: 0.9349271, rec: 0.9318842, f1: 0.93340325


Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079


Epoch 3/8 - 252.61s - loss: 604.5874 - batches: 695
Quality on test dataset: 
time to finish evaluation: 18.28s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1764	 112	 73	 0.9402985	 0.9602613	 0.95017505
I-ORG	 614	 84	 137	 0.87965614	 0.8175766	 0.847481
I-MISC	 244	 34	 102	 0.8776978	 0.70520234	 0.78205127
I-LOC	 220	 29	 37	 0.88353413	 0.8560311	 0.8695652
I-PER	 1268	 38	 39	 0.9709035	 0.97016066	 0.97053194
B-MISC	 799	 96	 123	 0.89273745	 0.8665944	 0.87947166
B-ORG	 1205	 123	 136	 0.9073795	 0.8985832	 0.90295994
B-PER	 1792	 110	 50	 0.94216615	 0.97285557	 0.95726496
tp: 7906 fp: 626 fn: 697 labels: 8
Macro-average	 prec: 0.9117967, rec: 0.88090813, f1: 0.89608634
Micro-average	 prec: 0.9266292, rec: 0.91898173, f1: 0.92278963


Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079


Epoch 5/8 - 257.56s - loss: 437.73123 - batches: 695
Quality on test dataset: 
time to finish evaluation: 17.89s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1806	 163	 31	 0.91721684	 0.9831247	 0.94902784
I-ORG	 606	 26	 145	 0.95886075	 0.8069241	 0.8763557
I-MISC	 287	 99	 59	 0.7435233	 0.82947975	 0.78415304
I-LOC	 233	 54	 24	 0.8118467	 0.9066148	 0.85661757
I-PER	 1273	 26	 34	 0.9799846	 0.9739862	 0.9769762
B-MISC	 846	 146	 76	 0.8528226	 0.9175705	 0.8840125
B-ORG	 1149	 37	 192	 0.9688027	 0.85682327	 0.9093787
B-PER	 1797	 77	 45	 0.9589114	 0.97557	 0.96716905
tp: 7997 fp: 628 fn: 606 labels: 8
Macro-average	 prec: 0.8989962, rec: 0.90626174, f1: 0.9026143
Micro-average	 prec: 0.9271884, rec: 0.92955947, f1: 0.9283724


Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079


Epoch 4/8 - 255.39s - loss: 508.80334 - batches: 695
Quality on test dataset: 
time to finish evaluation: 17.63s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1799	 270	 38	 0.8695022	 0.9793141	 0.921147
I-ORG	 616	 97	 135	 0.86395514	 0.82023966	 0.8415301
I-MISC	 253	 33	 93	 0.88461536	 0.73121387	 0.8006329
I-LOC	 236	 117	 21	 0.66855526	 0.91828793	 0.77377045
I-PER	 1256	 18	 51	 0.98587126	 0.96097934	 0.9732662
B-MISC	 799	 66	 123	 0.92369944	 0.8665944	 0.89423615
B-ORG	 1162	 106	 179	 0.9164038	 0.86651754	 0.89076275
B-PER	 1754	 52	 88	 0.9712071	 0.95222586	 0.96162283
tp: 7875 fp: 759 fn: 728 labels: 8
Macro-average	 prec: 0.8854762, rec: 0.8869216, f1: 0.8861983
Micro-average	 prec: 0.91209173, rec: 0.91537833, f1: 0.9137321


Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079


Epoch 6/8 - 262.11s - loss: 382.8735 - batches: 695
Quality on test dataset: 
time to finish evaluation: 17.92s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1749	 61	 88	 0.96629834	 0.9520958	 0.95914453
I-ORG	 682	 136	 69	 0.83374083	 0.9081225	 0.8693435
I-MISC	 268	 40	 78	 0.8701299	 0.7745665	 0.81957185
I-LOC	 215	 14	 42	 0.93886465	 0.83657587	 0.8847737
I-PER	 1280	 39	 27	 0.97043216	 0.979342	 0.97486675
B-MISC	 837	 96	 85	 0.8971061	 0.90780914	 0.90242594
B-ORG	 1232	 120	 109	 0.9112426	 0.9187174	 0.91496474
B-PER	 1795	 93	 47	 0.9507415	 0.97448426	 0.96246654
tp: 8058 fp: 599 fn: 545 labels: 8
Macro-average	 prec: 0.91731954, rec: 0.9064642, f1: 0.9118596
Micro-average	 prec: 0.9308074, rec: 0.93665, f1: 0.9337195


Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079


Epoch 5/8 - 263.75s - loss: 450.50388 - batches: 695
Quality on test dataset: 
time to finish evaluation: 17.58s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1749	 64	 88	 0.9646994	 0.9520958	 0.95835614
I-ORG	 689	 180	 62	 0.79286534	 0.9174434	 0.85061723
I-MISC	 275	 83	 71	 0.7681564	 0.79479766	 0.78124994
I-LOC	 210	 16	 47	 0.9292035	 0.8171206	 0.8695652
I-PER	 1271	 33	 36	 0.97469324	 0.972456	 0.9735733
B-MISC	 825	 103	 97	 0.88900864	 0.8947939	 0.89189196
B-ORG	 1239	 127	 102	 0.90702784	 0.9239374	 0.9154045
B-PER	 1791	 71	 51	 0.96186894	 0.9723127	 0.96706253
tp: 8049 fp: 677 fn: 554 labels: 8
Macro-average	 prec: 0.89844036, rec: 0.90561974, f1: 0.90201575
Micro-average	 prec: 0.9224158, rec: 0.93560386, f1: 0.92896307


Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079


Epoch 7/8 - 262.34s - loss: 330.9146 - batches: 695
Quality on test dataset: 
time to finish evaluation: 18.09s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1760	 74	 77	 0.95965105	 0.9580838	 0.9588668
I-ORG	 630	 36	 121	 0.9459459	 0.8388815	 0.88920254
I-MISC	 283	 93	 63	 0.75265956	 0.8179191	 0.7839335
I-LOC	 225	 20	 32	 0.9183673	 0.8754864	 0.8964143
I-PER	 1273	 32	 34	 0.97547895	 0.9739862	 0.974732
B-MISC	 837	 113	 85	 0.8810526	 0.90780914	 0.8942308
B-ORG	 1230	 96	 111	 0.9276018	 0.91722596	 0.92238474
B-PER	 1801	 70	 41	 0.9625869	 0.9777416	 0.97010505
tp: 8039 fp: 534 fn: 564 labels: 8
Macro-average	 prec: 0.915418, rec: 0.9083917, f1: 0.91189134
Micro-average	 prec: 0.9377114, rec: 0.93444145, f1: 0.93607366


Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079


Epoch 6/8 - 264.34s - loss: 384.8886 - batches: 695
Quality on test dataset: 
time to finish evaluation: 18.07s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1772	 84	 65	 0.95474136	 0.96461624	 0.9596534
I-ORG	 635	 50	 116	 0.9270073	 0.8455393	 0.88440114
I-MISC	 274	 78	 72	 0.77840906	 0.7919075	 0.7851002
I-LOC	 228	 19	 29	 0.9230769	 0.8871595	 0.9047619
I-PER	 1273	 32	 34	 0.97547895	 0.9739862	 0.974732
B-MISC	 842	 125	 80	 0.8707342	 0.9132321	 0.8914769
B-ORG	 1218	 86	 123	 0.93404907	 0.9082774	 0.920983
B-PER	 1791	 65	 51	 0.96497846	 0.9723127	 0.9686317
tp: 8033 fp: 539 fn: 570 labels: 8
Macro-average	 prec: 0.9160594, rec: 0.90712893, f1: 0.9115723
Micro-average	 prec: 0.93712085, rec: 0.933744, f1: 0.9354294


Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079


Epoch 8/8 - 266.21s - loss: 301.41052 - batches: 695
Quality on test dataset: 
time to finish evaluation: 18.45s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1768	 68	 69	 0.962963	 0.96243876	 0.96270084
I-ORG	 658	 49	 93	 0.9306931	 0.8761651	 0.9026063
I-MISC	 267	 56	 79	 0.8266254	 0.7716763	 0.7982063
I-LOC	 228	 14	 29	 0.94214875	 0.8871595	 0.91382766
I-PER	 1272	 35	 35	 0.9732211	 0.9732211	 0.9732211
B-MISC	 834	 98	 88	 0.8948498	 0.9045553	 0.8996764
B-ORG	 1239	 97	 102	 0.9273952	 0.9239374	 0.925663
B-PER	 1806	 94	 36	 0.9505263	 0.98045605	 0.9652592
tp: 8072 fp: 511 fn: 531 labels: 8
Macro-average	 prec: 0.9260528, rec: 0.90995115, f1: 0.9179313
Micro-average	 prec: 0.9404637, rec: 0.93827736, f1: 0.9393693


Epoch 7/8 - 256.62s - loss: 335.06775 - batches: 695
Quality on test dataset: 
time to finish evaluation: 8.79s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1791	 128	 46	 0.9332986	 0.9749592	 0.95367414
I-ORG	 639	 78	 112	 0.8912134	 0.8508655	 0.8705722
I-MISC	 262	 48	 84	 0.8451613	 0.75722545	 0.7987805
I-LOC	 238	 60	 19	 0.7986577	 0.92607003	 0.8576577
I-PER	 1260	 19	 47	 0.9851446	 0.9640398	 0.97447795
B-MISC	 811	 72	 111	 0.9184598	 0.8796095	 0.89861494
B-ORG	 1215	 95	 126	 0.92748094	 0.90604025	 0.9166353
B-PER	 1786	 56	 56	 0.96959823	 0.96959823	 0.96959823
tp: 8002 fp: 556 fn: 601 labels: 8
Macro-average	 prec: 0.90862685, rec: 0.903551, f1: 0.9060818
Micro-average	 prec: 0.93503153, rec: 0.9301407, f1: 0.9325797


Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079


Epoch 8/8 - 133.22s - loss: 299.64578 - batches: 695
Quality on test dataset: 
time to finish evaluation: 8.91s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1746	 56	 91	 0.9689234	 0.9504627	 0.95960426
I-ORG	 673	 77	 78	 0.8973333	 0.8961385	 0.8967355
I-MISC	 270	 43	 76	 0.8626198	 0.7803468	 0.8194234
I-LOC	 223	 10	 34	 0.95708156	 0.8677043	 0.9102041
I-PER	 1272	 41	 35	 0.9687738	 0.9732211	 0.9709923
B-MISC	 832	 109	 90	 0.88416576	 0.9023861	 0.893183
B-ORG	 1264	 143	 77	 0.8983653	 0.94258016	 0.9199418
B-PER	 1801	 76	 41	 0.95950985	 0.9777416	 0.9685399
tp: 8081 fp: 555 fn: 522 labels: 8
Macro-average	 prec: 0.9245966, rec: 0.9113227, f1: 0.91791165
Micro-average	 prec: 0.93573415, rec: 0.9393235, f1: 0.9375254
In [22]:
import pyspark.sql.functions as F

predictions = ner_model.transform(test_data)

predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)
+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |O         |
|-             |O           |O         |
|LEICESTERSHIRE|B-ORG       |B-ORG     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |B-LOC       |B-LOC     |
|1996-08-30    |O           |O         |
|West          |B-MISC      |B-MISC    |
|Indian        |I-MISC      |I-MISC    |
|all-rounder   |O           |O         |
|Phil          |B-PER       |B-PER     |
|Simmons       |I-PER       |I-PER     |
|took          |O           |O         |
|four          |O           |O         |
+--------------+------------+----------+
only showing top 20 rows

In [23]:
from sklearn.metrics import classification_report

preds_df = predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).toPandas()

print (classification_report(preds_df['ground_truth'], preds_df['prediction']))
              precision    recall  f1-score   support

       B-LOC       0.97      0.95      0.96      1837
      B-MISC       0.88      0.90      0.89       922
       B-ORG       0.90      0.94      0.92      1341
       B-PER       0.96      0.98      0.97      1842
       I-LOC       0.96      0.87      0.91       257
      I-MISC       0.86      0.78      0.82       346
       I-ORG       0.90      0.90      0.90       751
       I-PER       0.97      0.97      0.97      1307
           O       1.00      1.00      1.00     42759

    accuracy                           0.99     51362
   macro avg       0.93      0.92      0.93     51362
weighted avg       0.99      0.99      0.99     51362

Saving the Trained Model
In [24]:
ner_model.stages
Out[24]:
[WORD_EMBEDDINGS_MODEL_48cffc8b9a76, NerDLModel_6a88a8ead3fd]
In [25]:
ner_model.stages[1].write().overwrite().save("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02")
Prediction Pipeline
In [28]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')
    
glove_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

loaded_ner_model = NerDLModel.load("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(stages = [
      document,
      sentence,
      token,
      glove_embeddings,
      loaded_ner_model,
      converter
  ])

empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
In [33]:
text = '''
The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubs—the world’s strongest. In fact, only four teams in the English Premier League don’t have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegal’s goalkeeper, and Hakim Ziyech of Morocco. In Italy’s Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germany’s Bundesliga.
'''

sample_data = spark.createDataFrame([[text]]).toDF("text")

sample_data.show(truncate=False)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubsthe worlds strongest. In fact, only four teams in the English Premier League dont have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegals goalkeeper, and Hakim Ziyech of Morocco. In Italys Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germanys Bundesliga.
|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In [34]:
preds = prediction_model.transform(sample_data)

result_df = preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
                .select(F.expr("entities['0']").alias("chunk"),
                        F.expr("entities['1'].entity").alias("entity")).show(truncate=False)
+---------------+------+
|chunk          |entity|
+---------------+------+
|Merseyside     |ORG   |
|Liverpool      |ORG   |
|Mo Salah       |PER   |
|Egypt          |LOC   |
|Sadio Mané     |PER   |
|Senegal        |LOC   |
|African        |MISC  |
|European       |MISC  |
|English        |MISC  |
|Premier League |ORG   |
|Mr Salah       |PER   |
|Mr Mané        |PER   |
|Riyad Mahrez   |PER   |
|Algeria        |LOC   |
|Manchester City|LOC   |
|Wilfred Ndidi  |PER   |
|Nigeria        |LOC   |
|Chelsea        |ORG   |
|Edouard Mendy  |PER   |
|Senegals      |PER   |
+---------------+------+
only showing top 20 rows

In [35]:
from sparknlp.base import LightPipeline

light_model = LightPipeline(prediction_model)

result = light_model.annotate(text)

list(zip(result['token'], result['ner']))
Out[35]:
[('The', 'O'),
 ('final', 'O'),
 ('has', 'O'),
 ('its', 'O'),
 ('own', 'O'),
 ('Merseyside', 'B-ORG'),
 ('subplot', 'O'),
 (',', 'O'),
 ('as', 'O'),
 ('it', 'O'),
 ('will', 'O'),
 ('pit', 'O'),
 ('Liverpool', 'B-ORG'),
 ('forwards', 'O'),
 ('Mo', 'B-PER'),
 ('Salah', 'I-PER'),
 ('(', 'O'),
 ('of', 'O'),
 ('Egypt', 'B-LOC'),
 (':', 'O'),
 ('pictured', 'O'),
 ('above', 'O'),
 (',', 'O'),
 ('in', 'O'),
 ('white', 'O'),
 (',', 'O'),
 ('in', 'O'),
 ('the', 'O'),
 ('semi-final', 'O'),
 (')', 'O'),
 ('and', 'O'),
 ('Sadio', 'B-PER'),
 ('Mané', 'I-PER'),
 ('(', 'O'),
 ('of', 'O'),
 ('Senegal', 'B-LOC'),
 (')', 'O'),
 ('against', 'O'),
 ('each', 'O'),
 ('other', 'O'),
 ('.', 'O'),
 ('They', 'O'),
 ('are', 'O'),
 ('just', 'O'),
 ('two', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('African', 'B-MISC'),
 ('stars', 'O'),
 ('to', 'O'),
 ('play', 'O'),
 ('for', 'O'),
 ('European', 'B-MISC'),
 ('clubs—the', 'O'),
 ('world’s', 'O'),
 ('strongest', 'O'),
 ('.', 'O'),
 ('In', 'O'),
 ('fact', 'O'),
 (',', 'O'),
 ('only', 'O'),
 ('four', 'O'),
 ('teams', 'O'),
 ('in', 'O'),
 ('the', 'O'),
 ('English', 'B-MISC'),
 ('Premier', 'B-ORG'),
 ('League', 'I-ORG'),
 ('don’t', 'O'),
 ('have', 'O'),
 ('a', 'O'),
 ('player', 'O'),
 ('from', 'O'),
 ('the', 'O'),
 ('continent', 'O'),
 ('.', 'O'),
 ('Besides', 'O'),
 ('Mr', 'B-PER'),
 ('Salah', 'I-PER'),
 ('and', 'O'),
 ('Mr', 'B-PER'),
 ('Mané', 'I-PER'),
 (',', 'O'),
 ('Riyad', 'B-PER'),
 ('Mahrez', 'I-PER'),
 ('of', 'O'),
 ('Algeria', 'B-LOC'),
 ('is', 'O'),
 ('at', 'O'),
 ('Manchester', 'B-LOC'),
 ('City', 'I-LOC'),
 (',', 'O'),
 ('Wilfred', 'B-PER'),
 ('Ndidi', 'I-PER'),
 ('of', 'O'),
 ('Nigeria', 'B-LOC'),
 ('and', 'O'),
 ('Chelsea', 'B-ORG'),
 ('boasts', 'O'),
 ('Edouard', 'B-PER'),
 ('Mendy', 'I-PER'),
 (',', 'O'),
 ('Senegal’s', 'B-PER'),
 ('goalkeeper', 'O'),
 (',', 'O'),
 ('and', 'O'),
 ('Hakim', 'B-PER'),
 ('Ziyech', 'I-PER'),
 ('of', 'O'),
 ('Morocco', 'B-LOC'),
 ('.', 'O'),
 ('In', 'O'),
 ('Italy’s', 'B-MISC'),
 ('Serie', 'I-MISC'),
 ('A', 'I-MISC'),
 (',', 'O'),
 ('Kalidou', 'B-PER'),
 ('Koulibaly', 'I-PER'),
 ('of', 'O'),
 ('Senegal', 'B-LOC'),
 ('plays', 'O'),
 ('for', 'O'),
 ('Napoli', 'B-ORG'),
 ('and', 'O'),
 ('Franck', 'B-PER'),
 ('Kessie', 'I-PER'),
 ('of', 'O'),
 ('the', 'O'),
 ('Ivory', 'B-LOC'),
 ('Coast', 'I-LOC'),
 ('turns', 'O'),
 ('out', 'O'),
 ('for', 'O'),
 ('AC', 'B-ORG'),
 ('Milan', 'I-ORG'),
 ('.', 'O'),
 ('Eric', 'B-PER'),
 ('Maxim', 'I-PER'),
 ('Choupo-Moting', 'I-PER'),
 ('of', 'O'),
 ('Cameroon', 'B-LOC'),
 ('and', 'O'),
 ('Bouna', 'B-PER'),
 ('Sarr', 'I-PER'),
 ('of', 'O'),
 ('Senegal', 'B-LOC'),
 ('both', 'O'),
 ('play', 'O'),
 ('for', 'O'),
 ('Bayern', 'B-ORG'),
 ('Munich', 'I-ORG'),
 (',', 'O'),
 ('the', 'O'),
 ('dominant', 'O'),
 ('club', 'O'),
 ('in', 'O'),
 ('Germany’s', 'B-MISC'),
 ('Bundesliga', 'I-MISC'),
 ('.', 'O')]
In [37]:
import pandas as pd

result = light_model.fullAnnotate(text)

ner_df= pd.DataFrame([(int(x.metadata['sentence']), x.result, x.begin, x.end, y.result) for x,y in zip(result[0]["token"], result[0]["ner"])], 
                      columns=['sent_id','token','start','end','ner'])
ner_df.head(15)
Out[37]:
sent_id	token	start	end	ner
0	0	The	1	3	O
1	0	final	5	9	O
2	0	has	11	13	O
3	0	its	15	17	O
4	0	own	19	21	O
5	0	Merseyside	23	32	B-ORG
6	0	subplot	34	40	O
7	0	,	41	41	O
8	0	as	43	44	O
9	0	it	46	47	O
10	0	will	49	52	O
11	0	pit	54	56	O
12	0	Liverpool	58	66	B-ORG
13	0	forwards	68	75	O
14	0	Mo	77	78	B-PER
Highlight Entities
In [38]:
ann_text = light_model.fullAnnotate(text)[0]
ann_text.keys()
Out[38]:
dict_keys(['document', 'ner_span', 'token', 'ner', 'embeddings', 'sentence'])
In [39]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
print ('Standard Output')
visualiser.display(ann_text, label_col='ner_span', document_col='document')
Standard Output

The final has its own Merseyside ORG subplot, as it will pit Liverpool ORG forwards Mo Salah PER (of Egypt LOC: pictured above, in white, in the semi-final) and Sadio Mané PER (of Senegal LOC) against each other. They are just two of the African MISC stars to play for European MISC clubsthe worlds strongest. In fact, only four teams in the English MISC Premier League ORG dont have a player from the continent. Besides Mr Salah PER and Mr Mané PER, Riyad Mahrez PER of Algeria LOC is at Manchester City LOC, Wilfred Ndidi PER of Nigeria LOC and Chelsea ORG boasts Edouard Mendy PER, Senegals PER goalkeeper, and Hakim Ziyech PER of Morocco LOC. In Italys Serie A MISC, Kalidou Koulibaly PER of Senegal LOC plays for Napoli ORG and Franck Kessie PER of the Ivory Coast LOC turns out for AC Milan ORG. Eric Maxim Choupo-Moting PER of Cameroon LOC and Bouna Sarr PER of Senegal LOC both play for Bayern Munich ORG, the dominant club in Germanys Bundesliga MISC.
Streamlit
In [14]:
! pip install -q pyspark==3.1.2 spark-nlp

! pip install -q spark-nlp-display
In [ ]:
!pip install streamlit

!pip install pyngrok==4.1.1
In [2]:
! wget https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py
--2022-02-06 22:39:33--  https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7979 (7.8K) [text/plain]
Saving to: streamlit_me_ner_model.py.3

streamlit_me_ner_mo 100%[===================>]   7.79K  --.-KB/s    in 0s      

2022-02-06 22:39:34 (93.4 MB/s) - streamlit_me_ner_model.py.3 saved [7979/7979]

In [3]:
!ngrok authtoken 24jtZ2Watn1mc1bSG6v19fel7p1_2bYeRjRkniKqqhfgRs6ub
Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml
In [5]:
!streamlit run streamlit_me_ner_model.py &>/dev/null&
In [6]:
from pyngrok import ngrok

public_url = ngrok.connect(port='8501')
public_url
Out[6]:
'http://2d54-34-125-109-11.ngrok.io'
In [7]:
!killall ngrok

public_url = ngrok.connect(port='8501')
public_url
Out[7]:
'http://df30-34-125-109-11.ngrok.io'

Results

+---------------+------+
|chunk          |entity|
+---------------+------+
|Merseyside     |ORG   |
|Liverpool      |ORG   |
|Mo Salah       |PER   |
|Egypt          |LOC   |
|Sadio Mané     |PER   |
|Senegal        |LOC   |
|African        |MISC  |
|European       |MISC  |
|English        |MISC  |
|Premier League |ORG   |
|Mr Salah       |PER   |
|Mr Mané        |PER   |
|Riyad Mahrez   |PER   |
|Algeria        |LOC   |
|Manchester City|LOC   |
|Wilfred Ndidi  |PER   |
|Nigeria        |LOC   |
|Chelsea        |ORG   |
|Edouard Mendy  |PER   |
|Senegal’s      |PER   |
+---------------+------+

Model Information

Model Name: Ner_conll2003_100d
Type: ner
Compatibility: Spark NLP 3.1.2+
License: Open Source
Edition: Community
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.3 MB
Dependencies: glove100d

References

This model is trained based on data from : https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa

Benchmarking

       label  precision    recall  f1-score   support
       B-LOC       0.97      0.95      0.96      1837
      B-MISC       0.88      0.90      0.89       922
       B-ORG       0.90      0.94      0.92      1341
       B-PER       0.96      0.98      0.97      1842
       I-LOC       0.96      0.87      0.91       257
      I-MISC       0.86      0.78      0.82       346
       I-ORG       0.90      0.90      0.90       751
       I-PER       0.97      0.97      0.97      1307
           O       1.00      1.00      1.00     42759
    accuracy          -         -      0.99     51362
   macro-avg       0.93      0.92      0.93     51362
weighted-avg       0.99      0.99      0.99     51362