Finance NER (10-K, 10-Q, md, XBRL)

Description

This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.

Predicted Entities

ContractWithCustomerLiability, DebtInstrumentFairValue, DebtInstrumentMaturityDate, DebtInstrumentInterestRateStatedPercentage, DebtInstrumentRedemptionPricePercentage, DebtInstrumentInterestRateEffectivePercentage, DebtInstrumentBasisSpreadOnVariableRate1, CumulativeEffectOfNewAccountingPrincipleInPeriodOfAdoption, DebtInstrumentConvertibleConversionPrice1, ContractWithCustomerLiabilityRevenueRecognized, DebtInstrumentCarryingAmount, DebtInstrumentFaceAmount

Copy S3 URI

How to use

 
documentAssembler = nlp.DocumentAssembler() \
   .setInputCol("text") \
   .setOutputCol("document")

sentence = nlp.SentenceDetector() \
   .setInputCols(["document"]) \
   .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset3', 'en', 'finance/models')\
   .setInputCols(["sentence", "token", "embeddings"])\
   .setOutputCol("ner")
              
pipeline = nlp.Pipeline(stages=[documentAssembler,
                            sentence,
                            tokenizer,
                            embeddings,
                            nerTagger
                                ])
text = "Common Stock The authorized capital of the Company is 200,000,000 common shares , par value $ 0.001 , of which 12,481,724 are issued or outstanding ."

df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)

result = fit.transform(df)

result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
      F.expr("cols['1']").alias("ner_label"),\
      F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results


+----------+--------------------------+----------+
|token     |ner_label                 |confidence|
+----------+--------------------------+----------+
|Notes     |O                         |1.0       |
|and       |O                         |1.0       |
|Other     |O                         |1.0       |
|Debts     |O                         |1.0       |
|Payable   |O                         |1.0       |
|In        |O                         |1.0       |
|November  |O                         |1.0       |
|2013      |O                         |0.9999    |
|,         |O                         |1.0       |
|the       |O                         |1.0       |
|Rialto    |O                         |1.0       |
|segment   |O                         |1.0       |
|originally|O                         |1.0       |
|issued    |O                         |1.0       |
|$         |O                         |0.9999    |
|250       |B-DebtInstrumentFaceAmount|0.5981    |
|million   |O                         |1.0       |
|aggregate |O                         |1.0       |
|principal |O                         |1.0       |
|amount    |O                         |1.0       |
+----------+--------------------------+----------+

Model Information

Model Name: finner_10q_xbrl_md_subset3
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.5 MB

References

An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.

Benchmarking


label                                                           precision    recall  f1-score   support
B-ContractWithCustomerLiability                                    0.9847    0.8344    0.9033       308
B-ContractWithCustomerLiabilityRevenueRecognized                   0.9657    0.9888    0.9771       627
B-CumulativeEffectOfNewAccountingPrincipleInPeriodOfAdoption       0.9817    0.9267    0.9534       232
B-DebtInstrumentBasisSpreadOnVariableRate1                         0.9332    0.9828    0.9573      2145
B-DebtInstrumentCarryingAmount                                     0.7082    0.4751    0.5687       562
B-DebtInstrumentConvertibleConversionPrice1                        0.9770    0.9290    0.9524       183
B-DebtInstrumentFaceAmount                                         0.8648    0.7501    0.8034      1901
B-DebtInstrumentFairValue                                          0.9728    0.7150    0.8242       200
B-DebtInstrumentInterestRateEffectivePercentage                    0.8225    0.8024    0.8123       410
B-DebtInstrumentInterestRateStatedPercentage                       0.9410    0.9454    0.9432      2731
B-DebtInstrumentMaturityDate                                       0.9143    0.4812    0.6305       266
B-DebtInstrumentRedemptionPricePercentage                          0.9518    0.9518    0.9518       436
I-DebtInstrumentMaturityDate                                       0.9143    0.4812    0.6305       798
O                                                                  0.9947    0.9984    0.9966    246872
accuracy                                                               -          -    0.9917    257671
macro-avg                                                          0.9233    0.8045    0.8504    257671
weighted-avg                                                       0.9912    0.9917    0.9911    257671