Description
This model aims to detect Trading Symbols / Tickers in texts. You can then use Chunk Mappers to get more information about the company that ticker belongs to. This is a RoBerta-based model, you can find other lighter versions of this model in Models Hub.
Predicted Entities
TICKER
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("finner_roberta_ticker", "en", "finance/models")\
.setInputCols(["document",'token'])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline().setStages([document_assembler,
tokenizer,
tokenClassifier,
ner_converter])
text = ["""There are some serious purchases and sales of AMZN stock today."""]
test_data = spark.createDataFrame([text]).toDF("text")
model = pipeline.fit(test_data)
res= model.transform(test_data)
res.select('ner_chunk').collect()
Results
['AMZN']
Model Information
Model Name: | finner_roberta_ticker |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 465.3 MB |
Case sensitive: | true |
Max sentence length: | 256 |
References
Original dataset (https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020) and weak labelling on in-house texts
Benchmarking
label precision recall f1-score support
TICKER 0.98 0.97 0.98 9823
micro-avg 0.98 0.97 0.98 9823
macro-avg 0.98 0.97 0.98 9823
weighted-avg 0.98 0.97 0.98 9823