Description
This model aims to detect Trading Symbols / Tickers in texts. You can then use Chunk Mappers to get more information about the company that ticker belongs to.
This is a light version of the model, trained on Tweets. You can find heavier models (transformer-based, more specifically RoBerta-based) in our Models Hub.
Predicted Entities
TICKER
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")\
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter
])
model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([["""TTSLA, DTV, AMZN, NFLX and GPRO continue to look good here. All ıf them need to continue and make it into"""]]).toDF("text")
result = model.transform(data)
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("ticker"),
F.expr("cols['1']['entity']").alias("label")).show()
Results
+------+------+
|ticker| label|
+------+------+
| TTSLA|TICKER|
| DTV|TICKER|
| AMZN|TICKER|
| NFLX|TICKER|
| GPRO|TICKER|
+------+------+
Model Information
Model Name: | finner_ticker |
Type: | finance |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 1.2 MB |
References
Original dataset (https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020) and weak labelling on in-house texts
Benchmarking
label precision recall f1-score support
TICKER 0.97 0.96 0.97 9823
micro-avg 0.97 0.96 0.97 9823
macro-avg 0.97 0.96 0.97 9823
weighted-avg 0.97 0.96 0.97 9823