Description
This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.
This is a large (md
) model, trained with 200K sentences.
Predicted Entities
SharePrice
, SharebasedCompensationArrangementBySharebasedPaymentAwardAwardVestingRightsPercentage
, StockRepurchaseProgramRemainingAuthorizedRepurchaseAmount1
, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodWeightedAverageGrantDateFairValue
, ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAuthorized
, ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAvailableForGrant
, SharebasedCompensationArrangementBySharebasedPaymentAwardExpirationPeriod
, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodGross
, StockRepurchasedAndRetiredDuringPeriodShares
, StockRepurchaseProgramAuthorizedAmount1
, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsExercisesInPeriodTotalIntrinsicValue
, StockIssuedDuringPeriodSharesNewIssues
How to use
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset11', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
pipeline = nlp.Pipeline(stages=[documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
])
text = "The fair value of the stock option grants below were estimated on the date of the grant using a Black - Scholes valuation model and the assumptions in the following table : On December 1 , 2015 , the Company granted non - qualifed stock options under the Plan for 75,000 shares each to four directors : Sardar Biglari , Philip Cooley , Christopher Hogg and S."
df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)
result = fit.transform(df)
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
F.expr("cols['1']").alias("ner_label"),\
F.expr("cols['2']['confidence']").alias("confidence"))
result_df.show(50, truncate=100)
Results
+-----------+-------------------------------------------------------------------------------------+----------+
|token |ner_label |confidence|
+-----------+-------------------------------------------------------------------------------------+----------+
|The |O |1.0 |
|fair |O |1.0 |
|value |O |1.0 |
|of |O |1.0 |
|the |O |1.0 |
|stock |O |1.0 |
|option |O |1.0 |
|grants |O |1.0 |
|below |O |1.0 |
|were |O |1.0 |
|estimated |O |1.0 |
|on |O |1.0 |
|the |O |1.0 |
|date |O |1.0 |
|of |O |1.0 |
|the |O |1.0 |
|grant |O |1.0 |
|using |O |1.0 |
|a |O |1.0 |
|Black |O |1.0 |
|- |O |1.0 |
|Scholes |O |1.0 |
|valuation |O |1.0 |
|model |O |1.0 |
|and |O |1.0 |
|the |O |1.0 |
|assumptions|O |1.0 |
|in |O |1.0 |
|the |O |1.0 |
|following |O |1.0 |
|table |O |1.0 |
|: |O |1.0 |
|On |O |1.0 |
|December |O |1.0 |
|1 |O |1.0 |
|, |O |1.0 |
|2015 |O |1.0 |
|, |O |1.0 |
|the |O |1.0 |
|Company |O |1.0 |
|granted |O |1.0 |
|non |O |0.9995 |
|- |O |1.0 |
|qualifed |O |1.0 |
|stock |O |1.0 |
|options |O |1.0 |
|under |O |1.0 |
|the |O |1.0 |
|Plan |O |1.0 |
|for |O |1.0 |
|75,000 |B-ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodGross|0.9989 |
|shares |O |1.0 |
|each |O |1.0 |
|to |O |1.0 |
|four |O |1.0 |
|directors |O |1.0 |
|: |O |1.0 |
|Sardar |O |1.0 |
|Biglari |O |1.0 |
|, |O |1.0 |
|Philip |O |1.0 |
|Cooley |O |1.0 |
|, |O |1.0 |
|Christopher|O |1.0 |
|Hogg |O |1.0 |
|and |O |1.0 |
|S |O |1.0 |
|. |O |1.0 |
+-----------+-------------------------------------------------------------------------------------+----------+
Model Information
Model Name: | finner_10q_xbrl_md_subset11 |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 16.5 MB |
References
An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.