Description
This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.
This is a large (md
) model, trained with 200K sentences.
Predicted Entities
RevenueFromContractWithCustomerIncludingAssessedTax
, ShareBasedCompensationArrangementByShareBasedPaymentAwardAwardVestingPeriod1
, SaleOfStockPricePerShare
, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriod
, Revenues
, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsVestedInPeriodTotalFairValue
, SaleOfStockNumberOfSharesIssuedInTransaction
, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriodWeightedAverageGrantDateFairValue
, RevenueFromRelatedParties
, RevenueRemainingPerformanceObligation
, ShareBasedCompensation
, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsNonvestedNumber
How to use
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset10', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
pipeline = nlp.Pipeline(stages=[documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
])
text = "During the six months ended June 30 , 2015 we granted 776,000 MSUs at a total grant - date fair value of $ 4.3 million ."
df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)
result = fit.transform(df)
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
F.expr("cols['1']").alias("ner_label"),\
F.expr("cols['2']['confidence']").alias("confidence"))
result_df.show(50, truncate=100)
Results
+-------+----------------------------------------------------------------------------------------------------------+----------+
|token |ner_label |confidence|
+-------+----------------------------------------------------------------------------------------------------------+----------+
|During |O |1.0 |
|the |O |1.0 |
|six |O |1.0 |
|months |O |1.0 |
|ended |O |1.0 |
|June |O |1.0 |
|30 |O |1.0 |
|, |O |1.0 |
|2015 |O |1.0 |
|we |O |1.0 |
|granted|O |1.0 |
|776,000|B-ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriod|0.9562 |
|MSUs |O |1.0 |
|at |O |1.0 |
|a |O |1.0 |
|total |O |1.0 |
|grant |O |1.0 |
|- |O |1.0 |
|date |O |1.0 |
|fair |O |1.0 |
|value |O |1.0 |
|of |O |1.0 |
|$ |O |1.0 |
|4.3 |O |0.8671 |
|million|O |1.0 |
|. |O |1.0 |
+-------+----------------------------------------------------------------------------------------------------------+----------+
Model Information
Model Name: | finner_10q_xbrl_md_subset10 |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 16.5 MB |
References
An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.
Benchmarking
label precision recall f1-score support
B-RevenueFromContractWithCustomerIncludingAssessedTax 0.8369 0.6310 0.7195 187
B-RevenueFromRelatedParties 0.9418 0.9056 0.9233 625
B-RevenueRemainingPerformanceObligation 0.9801 0.9975 0.9887 395
B-Revenues 0.7830 0.9324 0.8512 414
B-SaleOfStockNumberOfSharesIssuedInTransaction 0.9106 0.9912 0.9492 226
B-SaleOfStockPricePerShare 0.8645 0.9710 0.9147 138
B-ShareBasedCompensation 0.9760 0.9896 0.9828 288
B-ShareBasedCompensationArrangementByShareBasedPaymentAwardAwardVestingPeriod1 0.9432 0.9659 0.9545 499
B-ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriod 0.9378 0.9944 0.9653 894
B-ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriodWeightedAverageGrantDateFairValue 0.9501 0.9932 0.9712 441
B-ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsNonvestedNumber 0.9120 0.9048 0.9084 126
B-ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsVestedInPeriodTotalFairValue 0.9775 0.9491 0.9631 275
I-SaleOfStockNumberOfSharesIssuedInTransaction 0.0000 0.0000 0.0000 1
I-ShareBasedCompensationArrangementByShareBasedPaymentAwardAwardVestingPeriod1 0.9457 0.9897 0.9672 387
I-ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriod 0.0000 0.0000 0.0000 1
O 0.9995 0.9979 0.9987 98997
accuracy - - 0.9959 103894
macro-avg 0.8099 0.8258 0.8161 103894
weighted-avg 0.9961 0.9959 0.9959 103894 0.0000 1