Description
This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.
Predicted Entities
AllocatedShareBasedCompensationExpense
, AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount
, AccrualForEnvironmentalLossContingencies
, BusinessCombinationAcquisitionRelatedCosts
, AreaOfRealEstateProperty
, AmortizationOfIntangibleAssets
, BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued
, AssetImpairmentCharges
, BusinessCombinationConsiderationTransferred1
, BusinessAcquisitionPercentageOfVotingInterestsAcquired
, AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife
, AmortizationOfFinancingCosts
How to use
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset1', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
pipeline = nlp.Pipeline(stages=[documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
])
text = "Common Stock The authorized capital of the Company is 200,000,000 common shares , par value $ 0.001 , of which 12,481,724 are issued or outstanding ."
df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)
result = fit.transform(df)
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
F.expr("cols['1']").alias("ner_label"),\
F.expr("cols['2']['confidence']").alias("confidence"))
result_df.show(50, truncate=100)
Results
+-----------+------------------------------------------------------------------------+----------+
|token |ner_label |confidence|
+-----------+------------------------------------------------------------------------+----------+
|Common |O |1.0 |
|Stock |O |1.0 |
|The |O |1.0 |
|authorized |O |1.0 |
|capital |O |1.0 |
|of |O |1.0 |
|the |O |1.0 |
|Company |O |1.0 |
|is |O |1.0 |
|200,000,000|B-BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued|0.9905 |
|common |O |1.0 |
|shares |O |1.0 |
|, |O |1.0 |
|par |O |1.0 |
|value |O |1.0 |
|$ |O |1.0 |
|0.001 |O |0.995 |
|, |O |1.0 |
|of |O |1.0 |
|which |O |1.0 |
+-----------+------------------------------------------------------------------------+----------+
Model Information
Model Name: | finner_10q_xbrl_md_subset1 |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 16.5 MB |
References
An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.
Benchmarking
label precision recall f1-score support
B-AccrualForEnvironmentalLossContingencies 1.0000 0.9386 0.9683 228
B-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife 0.9968 0.9778 0.9872 316
B-AllocatedShareBasedCompensationExpense 0.9931 0.9885 0.9908 1735
B-AmortizationOfFinancingCosts 0.9806 0.9268 0.9530 164
B-AmortizationOfIntangibleAssets 0.9910 0.9821 0.9865 1227
B-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount 0.9949 1.0000 0.9975 1570
B-AreaOfRealEstateProperty 0.9421 1.0000 0.9702 114
B-AssetImpairmentCharges 0.9298 0.9815 0.9550 270
B-BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued 0.9760 0.9606 0.9683 127
B-BusinessAcquisitionPercentageOfVotingInterestsAcquired 0.9572 0.9968 0.9766 314
B-BusinessCombinationAcquisitionRelatedCosts 0.9597 0.9375 0.9485 432
B-BusinessCombinationConsiderationTransferred1 0.9706 0.9354 0.9527 495
I-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife 0.9804 0.8929 0.9346 56
I-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount 1.0000 1.0000 1.0000 2
I-AreaOfRealEstateProperty 1.0000 1.0000 1.0000 1
O 0.9992 0.9995 0.9993 164664
accuracy - - 0.9986 171715
macro-avg 0.9795 0.9699 0.9743 171715
weighted-avg 0.9986 0.9986 0.9986 171715