Finance Numeric NER (10-Q, 10-K, lg, 139 entities, XBRL)

Description

This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 139 frequently occurring financial entities extracted from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.

This is a large (lg) model, trained with 200K sentences.

Predicted Entities

DeferredFinanceCostsNet, DisposalGroupIncludingDiscontinuedOperationConsideration, DebtInstrumentCarryingAmount, CommonStockSharesAuthorized, RestructuringCharges, DeferredFinanceCostsGross, OperatingLeasesRentExpenseNet, EquityMethodInvestmentOwnershipPercentage, ClassOfWarrantOrRightExercisePriceOfWarrantsOrRights1, DebtInstrumentTerm, DebtInstrumentRedemptionPricePercentage, CommonStockCapitalSharesReservedForFutureIssuance, LossContingencyAccrualAtCarryingValue, SaleOfStockPricePerShare, MinorityInterestOwnershipPercentageByParent, PropertyPlantAndEquipmentUsefulLife, TreasuryStockAcquiredAverageCostPerShare, Goodwill, SupplementalInformationForPropertyCasualtyInsuranceUnderwritersPriorYearClaimsAndClaimsAdjustmentExpense, CommonStockParOrStatedValuePerShare, OperatingLeaseWeightedAverageDiscountRatePercent, DebtInstrumentConvertibleConversionPrice1, AmortizationOfIntangibleAssets, PreferredStockSharesAuthorized, OperatingLeasePayments, DebtInstrumentMaturityDate, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodWeightedAverageGrantDateFairValue, EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate, AllocatedShareBasedCompensationExpense, PreferredStockDividendRatePercentage, StockRepurchaseProgramRemainingAuthorizedRepurchaseAmount1, TreasuryStockValueAcquiredCostMethod, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsVestedInPeriodTotalFairValue, IncomeTaxExpenseBenefit, DerivativeFixedInterestRate, RelatedPartyTransactionExpensesFromTransactionsWithRelatedParty, PublicUtilitiesRequestedRateIncreaseDecreaseAmount, RestructuringAndRelatedCostExpectedCost1, StockRepurchaseProgramAuthorizedAmount1, ShareBasedCompensation, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriodWeightedAverageGrantDateFairValue, LongTermDebtFairValue, LineOfCreditFacilityUnusedCapacityCommitmentFeePercentage, LineOfCreditFacilityCurrentBorrowingCapacity, ShareBasedCompensationArrangementByShareBasedPaymentAwardAwardVestingPeriod1, SharebasedCompensationArrangementBySharebasedPaymentAwardAwardVestingRightsPercentage, PaymentsToAcquireBusinessesGross, MinorityInterestOwnershipPercentageByNoncontrollingOwners, AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount, NumberOfReportableSegments, BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssumedIntangibleAssetsOtherThanGoodwill, OperatingLeaseCost, BusinessCombinationConsiderationTransferred1, UnrecognizedTaxBenefitsThatWouldImpactEffectiveTaxRate, CommonStockDividendsPerShareDeclared, AreaOfRealEstateProperty, LesseeOperatingLeaseTermOfContract, RevenueRemainingPerformanceObligation, RelatedPartyTransactionAmountsOfTransaction, InterestExpense, OperatingLeaseExpense, StockIssuedDuringPeriodSharesNewIssues, DebtInstrumentFaceAmount, CapitalizedContractCostAmortization, DebtInstrumentBasisSpreadOnVariableRate1, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsNonvestedNumber, GainsLossesOnExtinguishmentOfDebt, LineOfCreditFacilityRemainingBorrowingCapacity, OperatingLeaseRightOfUseAsset, OperatingLeaseWeightedAverageRemainingLeaseTerm1, OperatingLossCarryforwards, ConcentrationRiskPercentage1, GuaranteeObligationsMaximumExposure, StockRepurchasedAndRetiredDuringPeriodShares, LesseeOperatingLeaseRenewalTerm, ContractWithCustomerLiabilityRevenueRecognized, DefinedBenefitPlanContributionsByEmployer, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodGross, RepaymentsOfDebt, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognized, BusinessAcquisitionPercentageOfVotingInterestsAcquired, DebtInstrumentInterestRateEffectivePercentage, AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife, DebtInstrumentUnamortizedDiscount, ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAuthorized, BusinessCombinationContingentConsiderationLiability, DebtInstrumentInterestRateStatedPercentage, LeaseAndRentalExpense, RevenueFromContractWithCustomerExcludingAssessedTax, SharePrice, CommonStockSharesOutstanding, ContractWithCustomerLiability, DerivativeNotionalAmount, RevenueFromRelatedParties, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsExercisesInPeriodTotalIntrinsicValue, Revenues, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedShareBasedAwardsOtherThanOptions, AccrualForEnvironmentalLossContingencies, ProceedsFromIssuanceOfCommonStock, EmployeeServiceShareBasedCompensationTaxBenefitFromCompensationExpense, IncomeLossFromEquityMethodInvestments, NumberOfOperatingSegments, UnrecognizedTaxBenefits, RevenueFromContractWithCustomerIncludingAssessedTax, LossContingencyDamagesSoughtValue, SharebasedCompensationArrangementBySharebasedPaymentAwardExpirationPeriod, TreasuryStockSharesAcquired, FiniteLivedIntangibleAssetUsefulLife, BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssumedIntangibles, EffectiveIncomeTaxRateContinuingOperations, LossContingencyEstimateOfPossibleLoss, ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAvailableForGrant, BusinessCombinationAcquisitionRelatedCosts, StockRepurchasedDuringPeriodShares, CashAndCashEquivalentsFairValueDisclosure, LineOfCreditFacilityInterestRateAtPeriodEnd, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriod, CumulativeEffectOfNewAccountingPrincipleInPeriodOfAdoption, LettersOfCreditOutstandingAmount, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedPeriodForRecognition1, NumberOfRealEstateProperties, DebtWeightedAverageInterestRate, SaleOfStockNumberOfSharesIssuedInTransaction, AssetImpairmentCharges, Depreciation, DebtInstrumentFairValue, DefinedContributionPlanCostRecognized, InterestExpenseDebt, LossContingencyPendingClaimsNumber, PaymentsToAcquireBusinessesNetOfCashAcquired, BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued, GoodwillImpairmentLoss, LineOfCredit, AmortizationOfFinancingCosts, EquityMethodInvestments, LineOfCreditFacilityCommitmentFeePercentage, LongTermDebt, LineOfCreditFacilityMaximumBorrowingCapacity, OperatingLeaseLiability

Live Demo Copy S3 URI

How to use

from johnsnowlabs import nlp, finance
import pyspark.sql.functions as F

spark = nlp.start()

documentAssembler = nlp.DocumentAssembler() \
   .setInputCol("text") \
   .setOutputCol("document")

sentence = nlp.SentenceDetector() \
   .setInputCols(["document"]) \
   .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

ner_model = finance.NerModel.pretrained('finner_10q_xbrl', 'en', 'finance/models')\
   .setInputCols(["sentence", "token", "embeddings"])\
   .setOutputCol("ner")
              
pipeline = nlp.Pipeline(stages=[documentAssembler,
                            sentence,
                            tokenizer,
                            embeddings,
                            ner_model
                                ])

text = """Common Stock The authorized capital of the Company is 200,000,000 common shares , par value $ 0.001 , of which 12,481,724 are issued or outstanding ."""
df = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(df).transform(df)
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results

+-----------+-------------------------------------+----------+
|      token|                            ner_label|confidence|
+-----------+-------------------------------------+----------+
|     Common|                                    O|       1.0|
|      Stock|                                    O|       1.0|
|        The|                                    O|       1.0|
| authorized|                                    O|       1.0|
|    capital|                                    O|       1.0|
|         of|                                    O|       1.0|
|        the|                                    O|       1.0|
|    Company|                                    O|       1.0|
|         is|                                    O|       1.0|
|200,000,000|        B-CommonStockSharesAuthorized|    0.9932|
|     common|                                    O|       1.0|
|     shares|                                    O|       1.0|
|          ,|                                    O|       1.0|
|        par|                                    O|       1.0|
|      value|                                    O|       1.0|
|          $|                                    O|       1.0|
|      0.001|B-CommonStockParOrStatedValuePerShare|    0.9988|
|          ,|                                    O|       1.0|
|         of|                                    O|       1.0|
|      which|                                    O|       1.0|
| 12,481,724|       B-CommonStockSharesOutstanding|    0.9649|
|        are|                                    O|       1.0|
|     issued|                                    O|       1.0|
|         or|                                    O|       1.0|
|outstanding|                                    O|       1.0|
|          .|                                    O|       1.0|
+-----------+-------------------------------------+----------+

Model Information

Model Name: finner_10q_xbrl
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 17.0 MB

References

An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.

Benchmarking

label         tp    fp     fn    prec       rec       f1
Macro-average 53613 10309 10243 0.8324958  0.8049274 0.8184795
Micro-average 53613 10309 10243 0.8387253  0.8395922 0.8391586