Detect Entities in Twitter texts

Description

  • This model is trained on Broad Twitter Corpus (BTC) data-set, so that it can detect entities in Twitter-based texts successfully.
  • BertForTokenClassification() module, which uses the Deep Learning (torch) algorithm, is used to train this model.
  • The embeddings bert_base_cased is embedded inside the model so, you don’t need to use any embeddings in the NLP pipeline.

Predicted Entities

Live Demo Download

How to use

...

tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")\
  .setInputCols("token", "document")\
  .setOutputCol("ner")\
  .setCaseSensitive(True)

ner_converter = NerConverter()\
        .setInputCols(["document","token","ner"])\
        .setOutputCol("ner_chunk")

pipeline =  Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])

model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentences = ["""Wengers big mistakes is not being ruthless enough with bad players.""", """my dream FUUUUUULHAAAAAAM !!!.."""]
result = model.transform(spark.createDataFrame(pd.DataFrame({'text': test_sentences})))
...

val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")
  .setInputCols("token", "document")
  .setOutputCol("ner")
  .setCaseSensitive(True)

val ner_converter = NerConverter()
        .setInputCols(Array("document","token","ner"))
        .setOutputCol("ner_chunk")

val pipeline =  new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))

val data = Seq("Wengers big mistakes is not being ruthless enough with bad players.", "my dream FUUUUUULHAAAAAAM !!!..").toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+----------------+---------+
|chunk           |ner_label|
+----------------+---------+
|Wengers         |PER      |
|FUUUUUULHAAAAAAM|ORG      |
+----------------+---------+

Model Information

Model Name: bert_token_classifier_ner_btc
Compatibility: Spark NLP 3.2.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: en
Case sensitive: true
Max sentense length: 128

Data Source

https://github.com/juand-r/entity-recognition-datasets/tree/master/data/BTC

Benchmarking

              precision    recall  f1-score   support

       B-LOC       0.90      0.79      0.84       536
       B-ORG       0.80      0.79      0.79       821
       B-PER       0.95      0.62      0.75      1575
       I-LOC       0.96      0.76      0.85       181
       I-ORG       0.88      0.81      0.84       217
       I-PER       0.99      0.91      0.95       315
           O       0.97      0.99      0.98     26217

    accuracy                           0.96     29862
   macro avg       0.92      0.81      0.86     29862
weighted avg       0.96      0.96      0.96     29862