Finance NER (10-K, 10-Q, md, XBRL)

Description

This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.

This is a large (md) model, trained with 200K sentences.

Predicted Entities

PreferredStockSharesAuthorized, RelatedPartyTransactionExpensesFromTransactionsWithRelatedParty, PropertyPlantAndEquipmentUsefulLife, RestructuringCharges, PaymentsToAcquireBusinessesNetOfCashAcquired, ProceedsFromIssuanceOfCommonStock, PreferredStockDividendRatePercentage, RelatedPartyTransactionAmountsOfTransaction, RepaymentsOfDebt, RevenueFromContractWithCustomerExcludingAssessedTax, PublicUtilitiesRequestedRateIncreaseDecreaseAmount, RestructuringAndRelatedCostExpectedCost1

Copy S3 URI

How to use

 
documentAssembler = nlp.DocumentAssembler() \
   .setInputCol("text") \
   .setOutputCol("document")

sentence = nlp.SentenceDetector() \
   .setInputCols(["document"]) \
   .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset9', 'en', 'finance/models')\
   .setInputCols(["sentence", "token", "embeddings"])\
   .setOutputCol("ner")
              
pipeline = nlp.Pipeline(stages=[documentAssembler,
                            sentence,
                            tokenizer,
                            embeddings,
                            nerTagger
                                ])
text = "The Company recorded an expense of $ 407,094 in the second quarter of fiscal 2015 as a restructuring charge as an estimate for the difference between the rent that the Company pays its landlord and the rent received from the sub - tenant over the term of the sub - lease ."

df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)

result = fit.transform(df)

result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
      F.expr("cols['1']").alias("ner_label"),\
      F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results


+-------------+----------------------+----------+
|token        |ner_label             |confidence|
+-------------+----------------------+----------+
|The          |O                     |1.0       |
|Company      |O                     |1.0       |
|recorded     |O                     |1.0       |
|an           |O                     |1.0       |
|expense      |O                     |1.0       |
|of           |O                     |1.0       |
|$            |O                     |1.0       |
|407,094      |B-RestructuringCharges|0.997     |
|in           |O                     |1.0       |
|the          |O                     |1.0       |
|second       |O                     |1.0       |
|quarter      |O                     |1.0       |
|of           |O                     |1.0       |
|fiscal       |O                     |1.0       |
|2015         |O                     |1.0       |
|as           |O                     |1.0       |
|a            |O                     |1.0       |
|restructuring|O                     |1.0       |
|charge       |O                     |1.0       |
|as           |O                     |1.0       |
|an           |O                     |1.0       |
|estimate     |O                     |1.0       |
|for          |O                     |1.0       |
|the          |O                     |1.0       |
|difference   |O                     |1.0       |
|between      |O                     |1.0       |
|the          |O                     |1.0       |
|rent         |O                     |1.0       |
|that         |O                     |1.0       |
|the          |O                     |1.0       |
|Company      |O                     |1.0       |
|pays         |O                     |1.0       |
|its          |O                     |1.0       |
|landlord     |O                     |1.0       |
|and          |O                     |1.0       |
|the          |O                     |1.0       |
|rent         |O                     |1.0       |
|received     |O                     |1.0       |
|from         |O                     |1.0       |
|the          |O                     |1.0       |
|sub          |O                     |1.0       |
|-            |O                     |1.0       |
|tenant       |O                     |1.0       |
|over         |O                     |1.0       |
|the          |O                     |1.0       |
|term         |O                     |1.0       |
|of           |O                     |1.0       |
|the          |O                     |1.0       |
|sub          |O                     |1.0       |
|-            |O                     |1.0       |
|lease        |O                     |1.0       |
|.            |O                     |1.0       |
+-------------+----------------------+----------+

Model Information

Model Name: finner_10q_xbrl_md_subset9
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.4 MB

References

An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.

Benchmarking


label                                                              precision    recall  f1-score   support                                                
B-PaymentsToAcquireBusinessesNetOfCashAcquired                        0.9801    0.9610    0.9705       154
B-PreferredStockDividendRatePercentage                                0.9822    1.0000    0.9910       166
B-PreferredStockSharesAuthorized                                      1.0000    1.0000    1.0000       113
B-ProceedsFromIssuanceOfCommonStock                                   0.9846    0.9014    0.9412        71
B-PropertyPlantAndEquipmentUsefulLife                                 0.9672    0.9743    0.9707       272
B-PublicUtilitiesRequestedRateIncreaseDecreaseAmount                  1.0000    0.9894    0.9947       188
B-RelatedPartyTransactionAmountsOfTransaction                         0.8750    0.3853    0.5350       218
B-RelatedPartyTransactionExpensesFromTransactionsWithRelatedParty     0.7215    0.9620    0.8245       447
B-RepaymentsOfDebt                                                    0.9044    0.9762    0.9389       126
B-RestructuringAndRelatedCostExpectedCost1                            0.8871    0.9483    0.9167       174
B-RestructuringCharges                                                0.9428    0.9450    0.9439       872
B-RevenueFromContractWithCustomerExcludingAssessedTax                 0.9772    0.9062    0.9403       661
I-PreferredStockSharesAuthorized                                      1.0000    1.0000    1.0000         4
I-PropertyPlantAndEquipmentUsefulLife                                 1.0000    0.8171    0.8993        82
O                                                                     0.9989    0.9992    0.9990     77740
accuracy                                                                   -         -    0.9954     81288
macro-avg                                                             0.9481    0.9177    0.9244     81288
weighted-avg                                                          0.9957    0.9954    0.9952     81288