Extract Tickers on Financial Texts (RoBerta)

Description

This model aims to detect Trading Symbols / Tickers in texts. You can then use Chunk Mappers to get more information about the company that ticker belongs to. This is a RoBerta-based model, you can find other lighter versions of this model in Models Hub.

Predicted Entities

TICKER

Live Demo Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
      .setInputCols("document")\
      .setOutputCol("token")

tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("finner_roberta_ticker", "en", "finance/models")\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
      .setInputCols(["document", "token", "ner"])\
      .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline().setStages([document_assembler,
                                 tokenizer, 
                                 tokenClassifier,
                                 ner_converter])

text = ["""There are some serious purchases and sales of AMZN stock today."""]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

res.select('ner_chunk').collect()

Results

['AMZN']

Model Information

Model Name: finner_roberta_ticker
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: en
Size: 465.3 MB
Case sensitive: true
Max sentence length: 256

References

Original dataset (https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020) and weak labelling on in-house texts

Benchmarking

       label  precision    recall  f1-score   support
      TICKER       0.98      0.97      0.98      9823
   micro-avg       0.98      0.97      0.98      9823
   macro-avg       0.98      0.97      0.98      9823
weighted-avg       0.98      0.97      0.98      9823