Finance NER (10-Q, 10-K, md, XBRL)

Description

This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 6 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.

Predicted Entities

TreasuryStockValueAcquiredCostMethod, StockRepurchasedDuringPeriodShares, TreasuryStockAcquiredAverageCostPerShare, UnrecognizedTaxBenefits, SupplementalInformationForPropertyCasualtyInsuranceUnderwritersPriorYearClaimsAndClaimsAdjustmentExpense, TreasuryStockSharesAcquired

Copy S3 URI

How to use

 
documentAssembler = nlp.DocumentAssembler() \
   .setInputCol("text") \
   .setOutputCol("document")

sentence = nlp.SentenceDetector() \
   .setInputCols(["document"]) \
   .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset12', 'en', 'finance/models')\
   .setInputCols(["sentence", "token", "embeddings"])\
   .setOutputCol("ner")
              
pipeline = nlp.Pipeline(stages=[documentAssembler,
                            sentence,
                            tokenizer,
                            embeddings,
                            nerTagger
                                ])
text = "At both May 31 , 2016 and November 30 , 2015 , the Company had $ 12.3 million of gross unrecognized tax benefits . "

df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)

result = fit.transform(df)

result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
      F.expr("cols['1']").alias("ner_label"),\
      F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results


+-----------+-------------------------------------------------------------------------------------+----------+
|token      |ner_label                                                                            |confidence|
+------------+-------------------------+----------+
|token       |ner_label                |confidence|
+------------+-------------------------+----------+
|At          |O                        |1.0       |
|both        |O                        |1.0       |
|May         |O                        |1.0       |
|31          |O                        |1.0       |
|,           |O                        |1.0       |
|2016        |O                        |1.0       |
|and         |O                        |1.0       |
|November    |O                        |1.0       |
|30          |O                        |1.0       |
|,           |O                        |1.0       |
|2015        |O                        |1.0       |
|,           |O                        |1.0       |
|the         |O                        |1.0       |
|Company     |O                        |1.0       |
|had         |O                        |1.0       |
|$           |O                        |1.0       |
|12.3        |B-UnrecognizedTaxBenefits|1.0       |
|million     |O                        |1.0       |
|of          |O                        |1.0       |
|gross       |O                        |1.0       |
|unrecognized|O                        |1.0       |
|tax         |O                        |1.0       |
|benefits    |O                        |1.0       |
|.           |O                        |1.0       |
+------------+-------------------------+----------+

Model Information

Model Name: finner_10q_xbrl_md_subset12
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.4 MB

References

An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.

Benchmarking


label                                                                                                       precision    recall  f1-score   support
B-StockRepurchasedDuringPeriodShares                                                                           0.5798    0.4523    0.5082       241
B-SupplementalInformationForPropertyCasualtyInsuranceUnderwritersPriorYearClaimsAndClaimsAdjustmentExpense     0.9008    1.0000    0.9478       236
B-TreasuryStockAcquiredAverageCostPerShare                                                                     0.8359    0.9819    0.9030       166
B-TreasuryStockSharesAcquired                                                                                  0.6911    0.8193    0.7497       415
B-TreasuryStockValueAcquiredCostMethod                                                                         0.7214    0.5153    0.6012       196
B-UnrecognizedTaxBenefits                                                                                      0.9897    0.9897    0.9897       291
I-TreasuryStockSharesAcquired                                                                                  0.0000    0.0000    0.0000         1
O                                                                                                              0.9969    0.9962    0.9965     32427
accuracy                                                                                                           -         -     0.9873     33973
macro-avg                                                                                                      0.7144    0.7193    0.7120     33973
weighted-avg                                                                                                   0.9870    0.9873    0.9869     33973