Extract Tickers on Financial Texts

Description

This model aims to detect Trading Symbols / Tickers in texts. You can then use Chunk Mappers to get more information about the company that ticker belongs to.

This is a light version of the model, trained on Tweets. You can find heavier models (transformer-based, more specifically RoBerta-based) in our Models Hub.

Predicted Entities

TICKER

Live Demo Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\
    .setInputCols(["document", "token", "embeddings"])\
    .setOutputCol("ner")\

ner_converter = nlp.NerConverter()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
    ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

data = spark.createDataFrame([["""TTSLA, DTV, AMZN, NFLX and GPRO continue to look good here. All ıf them need to continue and make it into"""]]).toDF("text")

result = model.transform(data)

result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
               .select(F.expr("cols['0']").alias("ticker"),
                       F.expr("cols['1']['entity']").alias("label")).show()

Results

+------+------+
|ticker| label|
+------+------+
| TTSLA|TICKER|
|   DTV|TICKER|
|  AMZN|TICKER|
|  NFLX|TICKER|
|  GPRO|TICKER|
+------+------+

Model Information

Model Name: finner_ticker
Type: finance
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 1.2 MB

References

Original dataset (https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020) and weak labelling on in-house texts

Benchmarking

       label  precision    recall  f1-score   support
      TICKER       0.97      0.96      0.97      9823
   micro-avg       0.97      0.96      0.97      9823
   macro-avg       0.97      0.96      0.97      9823
weighted-avg       0.97      0.96      0.97      9823