Finance Revenue NER (10-Q, 10-K, XBRL, lg)

Description

This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 7 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.

Predicted Entities

NumberOfReportableSegments, RevenueFromContractWithCustomerExcludingAssessedTax, ContractWithCustomerLiabilityRevenueRecognized, CumulativeEffectOfNewAccountingPrincipleInPeriodOfAdoption, NumberOfOperatingSegments, RevenueFromRelatedParties, RevenueFromContractWithCustomerIncludingAssessedTax

Copy S3 URI

How to use

 
documentAssembler = nlp.DocumentAssembler() \
   .setInputCol("text") \
   .setOutputCol("document")

sentence = nlp.SentenceDetector() \
   .setInputCols(["document"]) \
   .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_lg_revenue', 'en', 'finance/models')\
   .setInputCols(["sentence", "token", "embeddings"])\
   .setOutputCol("ner")
              
pipeline = nlp.Pipeline(stages=[documentAssembler,
                            sentence,
                            tokenizer,
                            embeddings,
                            nerTagger
                                ])
text = "In addition , during the first quarter of 2016 , the Company made the decision to divide the Southeast Florida operating division into two operating segments to maximize operational efficiencies given the continued growth of the division ."

df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)

result = fit.transform(df)

result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
      F.expr("cols['1']").alias("ner_label"),\
      F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results


+------------+---------------------------+----------+
|token       |ner_label                  |confidence|
+------------+---------------------------+----------+
|In          |O                          |1.0       |
|addition    |O                          |1.0       |
|,           |O                          |1.0       |
|during      |O                          |1.0       |
|the         |O                          |1.0       |
|first       |O                          |1.0       |
|quarter     |O                          |1.0       |
|of          |O                          |1.0       |
|2016        |O                          |1.0       |
|,           |O                          |1.0       |
|the         |O                          |1.0       |
|Company     |O                          |1.0       |
|made        |O                          |1.0       |
|the         |O                          |1.0       |
|decision    |O                          |1.0       |
|to          |O                          |1.0       |
|divide      |O                          |1.0       |
|the         |O                          |1.0       |
|Southeast   |O                          |1.0       |
|Florida     |O                          |1.0       |
|operating   |O                          |1.0       |
|division    |O                          |1.0       |
|into        |O                          |1.0       |
|two         |B-NumberOfOperatingSegments|0.9831    |
|operating   |O                          |1.0       |
|segments    |O                          |1.0       |
|to          |O                          |1.0       |
|maximize    |O                          |1.0       |
|operational |O                          |1.0       |
|efficiencies|O                          |1.0       |
|given       |O                          |1.0       |
|the         |O                          |1.0       |
|continued   |O                          |1.0       |
|growth      |O                          |1.0       |
|of          |O                          |1.0       |
|the         |O                          |1.0       |
|division    |O                          |1.0       |
|.           |O                          |1.0       |
+------------+---------------------------+----------+

Model Information

Model Name: finner_10q_xbrl_lg_revenue
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.3 MB

References

An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.

Benchmarking


label                                                           precision    recall  f1-score   support
            B-ContractWithCustomerLiabilityRevenueRecognized     0.9516    0.9502    0.9509       642
B-CumulativeEffectOfNewAccountingPrincipleInPeriodOfAdoption     0.8588    0.9648    0.9087       227
                                 B-NumberOfOperatingSegments     0.8159    0.8654    0.8399       379
                                B-NumberOfReportableSegments     0.9186    0.8899    0.9040       672
       B-RevenueFromContractWithCustomerExcludingAssessedTax     0.7234    0.9223    0.8108       618
       B-RevenueFromContractWithCustomerIncludingAssessedTax     0.9130    0.2500    0.3925       168
                                 B-RevenueFromRelatedParties     0.9161    0.9192    0.9176       594
                                                           O     0.9998    0.9988    0.9993     80111
                                                    accuracy       -           -     0.9942     83411
                                                   macro-avg     0.8872    0.8451    0.8405     83411
                                                weighted-avg     0.9947    0.9942    0.9940     83411