Description
This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 12 numeric financial entities from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.
This is a large (md
) model, trained with 200K sentences.
Predicted Entities
OperatingLeaseExpense
, OperatingLeasesRentExpenseNet
, OperatingLossCarryforwards
, OperatingLeaseLiability
, OperatingLeaseWeightedAverageRemainingLeaseTerm1
, OperatingLeaseCost
, NumberOfRealEstateProperties
, OperatingLeaseWeightedAverageDiscountRatePercent
, OperatingLeaseRightOfUseAsset
, NumberOfReportableSegments
, OperatingLeasePayments
, PaymentsToAcquireBusinessesGross
How to use
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
nerTagger = finance.NerModel.pretrained('finner_10q_xbrl_md_subset8', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
pipeline = nlp.Pipeline(stages=[documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
])
text = "On February 1 , 2016 , we acquired Tideland Signal Corporation ( “ Tideland ” ) , a leading producer of analytics solutions in the coastal and ocean management sectors , for $ 70 million . "
df = spark.createDataFrame([[text]]).toDF("text")
fit = pipeline.fit(df)
result = fit.transform(df)
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),\
F.expr("cols['1']").alias("ner_label"),\
F.expr("cols['2']['confidence']").alias("confidence"))
result_df.show(50, truncate=100)
Results
+-----------+----------------------------------+----------+
|token |ner_label |confidence|
+-----------+----------------------------------+----------+
|On |O |1.0 |
|February |O |1.0 |
|1 |O |1.0 |
|, |O |1.0 |
|2016 |O |1.0 |
|, |O |1.0 |
|we |O |1.0 |
|acquired |O |1.0 |
|Tideland |O |1.0 |
|Signal |O |1.0 |
|Corporation|O |1.0 |
|( |O |1.0 |
|“ |O |1.0 |
|Tideland |O |1.0 |
|” |O |1.0 |
|) |O |1.0 |
|, |O |1.0 |
|a |O |1.0 |
|leading |O |1.0 |
|producer |O |1.0 |
|of |O |1.0 |
|analytics |O |1.0 |
|solutions |O |1.0 |
|in |O |1.0 |
|the |O |1.0 |
|coastal |O |1.0 |
|and |O |1.0 |
|ocean |O |1.0 |
|management |O |1.0 |
|sectors |O |1.0 |
|, |O |1.0 |
|for |O |1.0 |
|$ |O |1.0 |
|70 |B-PaymentsToAcquireBusinessesGross|1.0 |
|million |O |1.0 |
|. |O |1.0 |
+-----------+----------------------------------+----------+
Model Information
Model Name: | finner_10q_xbrl_md_subset8 |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 16.5 MB |
References
An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.
Benchmarking
label precision recall f1-score support
B-NumberOfRealEstateProperties 0.9556 0.9894 0.9722 283
B-NumberOfReportableSegments 0.9862 0.9984 0.9923 645
B-OperatingLeaseCost 0.7669 0.6793 0.7205 184
B-OperatingLeaseExpense 0.5362 0.1979 0.2891 187
B-OperatingLeaseLiability 0.8584 0.9597 0.9062 695
B-OperatingLeasePayments 0.9389 1.0000 0.9685 169
B-OperatingLeaseRightOfUseAsset 0.9609 0.9022 0.9306 818
B-OperatingLeaseWeightedAverageDiscountRatePercent 0.9471 0.9801 0.9633 201
B-OperatingLeaseWeightedAverageRemainingLeaseTerm1 0.9123 0.9905 0.9498 210
B-OperatingLeasesRentExpenseNet 0.5714 0.9346 0.7092 214
B-OperatingLossCarryforwards 0.9135 1.0000 0.9548 169
B-PaymentsToAcquireBusinessesGross 0.9033 0.9972 0.9479 356
I-OperatingLeaseWeightedAverageRemainingLeaseTerm1 0.7222 0.8125 0.7647 16
O 0.9996 0.9984 0.9990 109729
accuracy - - 0.9954 113876
macro-avg 0.8552 0.8886 0.8620 113876
weighted-avg 0.9955 0.9954 0.9952 113876