Finance NER (10-K, 10-Q, md, XBRL)

Description

This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.

Predicted Entities

AllocatedShareBasedCompensationExpense, AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount, AccrualForEnvironmentalLossContingencies, BusinessCombinationAcquisitionRelatedCosts, AreaOfRealEstateProperty, AmortizationOfIntangibleAssets, BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued, AssetImpairmentCharges, BusinessCombinationConsiderationTransferred1, BusinessAcquisitionPercentageOfVotingInterestsAcquired, AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife, AmortizationOfFinancingCosts

Copy S3 URI

How to use

 
documentAssembler = nlp.DocumentAssembler() \
   .setInputCol("text") \
   .setOutputCol("document")

sentence = nlp.SentenceDetector() \
   .setInputCols(["document"]) \
   .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset1', 'en', 'finance/models')\
   .setInputCols(["sentence", "token", "embeddings"])\
   .setOutputCol("ner")
              
pipeline = nlp.Pipeline(stages=[documentAssembler,
                            sentence,
                            tokenizer,
                            embeddings,
                            nerTagger
                                ])
text = "Common Stock The authorized capital of the Company is 200,000,000 common shares , par value $ 0.001 , of which 12,481,724 are issued or outstanding ."

df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)

result = fit.transform(df)

result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
      F.expr("cols['1']").alias("ner_label"),\
      F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results


+-----------+------------------------------------------------------------------------+----------+
|token      |ner_label                                                               |confidence|
+-----------+------------------------------------------------------------------------+----------+
|Common     |O                                                                       |1.0       |
|Stock      |O                                                                       |1.0       |
|The        |O                                                                       |1.0       |
|authorized |O                                                                       |1.0       |
|capital    |O                                                                       |1.0       |
|of         |O                                                                       |1.0       |
|the        |O                                                                       |1.0       |
|Company    |O                                                                       |1.0       |
|is         |O                                                                       |1.0       |
|200,000,000|B-BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued|0.9905    |
|common     |O                                                                       |1.0       |
|shares     |O                                                                       |1.0       |
|,          |O                                                                       |1.0       |
|par        |O                                                                       |1.0       |
|value      |O                                                                       |1.0       |
|$          |O                                                                       |1.0       |
|0.001      |O                                                                       |0.995     |
|,          |O                                                                       |1.0       |
|of         |O                                                                       |1.0       |
|which      |O                                                                       |1.0       |
+-----------+------------------------------------------------------------------------+----------+

Model Information

Model Name: finner_10q_xbrl_md_subset1
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.5 MB

References

An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.

Benchmarking

label                                                                    precision    recall  f1-score   support
B-AccrualForEnvironmentalLossContingencies                                  1.0000    0.9386    0.9683       228
B-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife              0.9968    0.9778    0.9872       316
B-AllocatedShareBasedCompensationExpense                                    0.9931    0.9885    0.9908      1735
B-AmortizationOfFinancingCosts                                              0.9806    0.9268    0.9530       164
B-AmortizationOfIntangibleAssets                                            0.9910    0.9821    0.9865      1227
B-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount     0.9949    1.0000    0.9975      1570
B-AreaOfRealEstateProperty                                                  0.9421    1.0000    0.9702       114
B-AssetImpairmentCharges                                                    0.9298    0.9815    0.9550       270
B-BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued    0.9760    0.9606    0.9683       127
B-BusinessAcquisitionPercentageOfVotingInterestsAcquired                    0.9572    0.9968    0.9766       314
B-BusinessCombinationAcquisitionRelatedCosts                                0.9597    0.9375    0.9485       432
B-BusinessCombinationConsiderationTransferred1                              0.9706    0.9354    0.9527       495
I-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife              0.9804    0.8929    0.9346        56
I-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount     1.0000    1.0000    1.0000         2
I-AreaOfRealEstateProperty                                                  1.0000    1.0000    1.0000         1
O                                                                           0.9992    0.9995    0.9993    164664
accuracy                                                                    -         -         0.9986    171715
macro-avg                                                                   0.9795    0.9699    0.9743    171715
weighted-avg                                                                0.9986    0.9986    0.9986    171715