Description
- This model is trained on Broad Twitter Corpus (BTC) dataset, so that it can detect entities in Twitter-based texts successfully.
- It’s based on
bert_base_cased
embeddings, which are included in the model, so you don’t need to use any embeddings component in the NLP pipeline.
Predicted Entities
PER
, LOC
, ORG
Live Demo Open in Colab Download Copy S3 URI
How to use
...
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])
model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
test_sentences = ["""Pentagram's Dominic Lippa is working on a new identity for University of Arts London."""]
result = model.transform(spark.createDataFrame(pd.DataFrame({'text': test_sentences})))
...
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")
.setInputCols("token", "document")
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("Pentagram's Dominic Lippa is working on a new identity for University of Arts London.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.classifier_ner_btc").predict("""Pentagram's Dominic Lippa is working on a new identity for University of Arts London.""")
Results
+--------------------------+---------+
|chunk |ner_label|
+--------------------------+---------+
|Pentagram's |ORG |
|Dominic Lippa |PER |
|University of Arts London |ORG |
+--------------------------+---------+
Model Information
Model Name: | bert_token_classifier_ner_btc |
Compatibility: | Spark NLP 3.2.2+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Case sensitive: | true |
Max sentense length: | 128 |
Data Source
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/BTC
Benchmarking
label precision recall f1-score support
B-LOC 0.90 0.79 0.84 536
B-ORG 0.80 0.79 0.79 821
B-PER 0.95 0.62 0.75 1575
I-LOC 0.96 0.76 0.85 181
I-ORG 0.88 0.81 0.84 217
I-PER 0.99 0.91 0.95 315
O 0.97 0.99 0.98 26217
accuracy - - 0.96 29862
macro-avg 0.92 0.81 0.86 29862
weighted-avg 0.96 0.96 0.96 29862