Finance NER (10-K, 10-Q, md, XBRL)

Description

This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.

Predicted Entities

FiniteLivedIntangibleAssetUsefulLife, EquityMethodInvestmentOwnershipPercentage, Goodwill, GoodwillImpairmentLoss, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognized, EmployeeServiceShareBasedCompensationTaxBenefitFromCompensationExpense, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedShareBasedAwardsOtherThanOptions, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedPeriodForRecognition1, EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate, GainsLossesOnExtinguishmentOfDebt, EquityMethodInvestments, GuaranteeObligationsMaximumExposure

Copy S3 URI

How to use

 
documentAssembler = nlp.DocumentAssembler() \
   .setInputCol("text") \
   .setOutputCol("document")

sentence = nlp.SentenceDetector() \
   .setInputCols(["document"]) \
   .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset5', 'en', 'finance/models')\
   .setInputCols(["sentence", "token", "embeddings"])\
   .setOutputCol("ner")
              
pipeline = nlp.Pipeline(stages=[documentAssembler,
                            sentence,
                            tokenizer,
                            embeddings,
                            nerTagger
                                ])
text = "In its June 29 , 2015 ruling , the District Court determined that the Company will be permitted to stay the judgment during appeal by posting a bond in the amount of $ 223.4 million related to pending litigation .    "

df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)

result = fit.transform(df)

result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
      F.expr("cols['1']").alias("ner_label"),\
      F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results


+----------+-------------------------------------+----------+
|token     |ner_label                            |confidence|
+----------+-------------------------------------+----------+
|In        |O                                    |1.0       |
|its       |O                                    |1.0       |
|June      |O                                    |1.0       |
|29        |O                                    |1.0       |
|,         |O                                    |1.0       |
|2015      |O                                    |1.0       |
|ruling    |O                                    |1.0       |
|,         |O                                    |1.0       |
|the       |O                                    |1.0       |
|District  |O                                    |1.0       |
|Court     |O                                    |1.0       |
|determined|O                                    |1.0       |
|that      |O                                    |1.0       |
|the       |O                                    |1.0       |
|Company   |O                                    |1.0       |
|will      |O                                    |1.0       |
|be        |O                                    |1.0       |
|permitted |O                                    |1.0       |
|to        |O                                    |1.0       |
|stay      |O                                    |1.0       |
|the       |O                                    |1.0       |
|judgment  |O                                    |1.0       |
|during    |O                                    |1.0       |
|appeal    |O                                    |1.0       |
|by        |O                                    |1.0       |
|posting   |O                                    |1.0       |
|a         |O                                    |1.0       |
|bond      |O                                    |1.0       |
|in        |O                                    |1.0       |
|the       |O                                    |1.0       |
|amount    |O                                    |1.0       |
|of        |O                                    |1.0       |
|$         |O                                    |1.0       |
|223.4     |B-GuaranteeObligationsMaximumExposure|0.9804    |
|million   |O                                    |1.0       |
|related   |O                                    |1.0       |
|to        |O                                    |1.0       |
|pending   |O                                    |1.0       |
|litigation|O                                    |1.0       |
|.         |O                                    |1.0       |
+----------+-------------------------------------+----------+


Model Information

Model Name: finner_10q_xbrl_md_subset5
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.5 MB

References

An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.

Benchmarking



label                                                                                                                        precision    recall  f1-score   support
B-EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate                                                           0.9531    1.0000    0.9760       427
B-EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognized                                     0.7563    0.7098    0.7323       634
B-EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedPeriodForRecognition1                0.9271    0.9946    0.9597       742
B-EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedShareBasedAwardsOtherThanOptions     0.4677    0.6230    0.5343       244
B-EmployeeServiceShareBasedCompensationTaxBenefitFromCompensationExpense                                                        0.9742    0.9869    0.9805       153
B-EquityMethodInvestmentOwnershipPercentage                                                                                     0.9828    0.9896    0.9862       866
B-EquityMethodInvestments                                                                                                       0.9970    0.8770    0.9331       374
B-FiniteLivedIntangibleAssetUsefulLife                                                                                          0.9970    0.9432    0.9693       352
B-GainsLossesOnExtinguishmentOfDebt                                                                                             0.9811    0.9962    0.9886       261
B-Goodwill                                                                                                                      0.9759    0.9824    0.9791       454
B-GoodwillImpairmentLoss                                                                                                        0.9881    0.9022    0.9432       184
B-GuaranteeObligationsMaximumExposure                                                                                           0.9651    0.9881    0.9765       252
I-EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedPeriodForRecognition1                0.9423    0.9245    0.9333        53
I-FiniteLivedIntangibleAssetUsefulLife                                                                                          1.0000    0.9221    0.9595        77
O                                                                                                                               0.9990    0.9986    0.9988    133006
accuracy                                                                                                                             -         -    0.9958    138079
macro-avg                                                                                                                       0.9271    0.9225    0.9234    138079
weighted-avg                                                                                                                    0.9961    0.9958    0.9959    138079