Description
This model is a Named Entity Recognition (NER) model focused on financial numeric items. It identifies 139 frequently occurring financial entities extracted from diverse 10-Q and 10-K reports. These entities are annotated using eXtensible Business Reporting Language (XBRL) tags. The annotation process primarily targets numerical tokens, and the context plays a crucial role in accurately assigning the appropriate entity type from the 139 most common financial entities available in the dataset.
This is a large (lg) model, trained with 200K sentences.
Predicted Entities
DeferredFinanceCostsNet, DisposalGroupIncludingDiscontinuedOperationConsideration, DebtInstrumentCarryingAmount, CommonStockSharesAuthorized, RestructuringCharges, DeferredFinanceCostsGross, OperatingLeasesRentExpenseNet, EquityMethodInvestmentOwnershipPercentage, ClassOfWarrantOrRightExercisePriceOfWarrantsOrRights1, DebtInstrumentTerm, DebtInstrumentRedemptionPricePercentage, CommonStockCapitalSharesReservedForFutureIssuance, LossContingencyAccrualAtCarryingValue, SaleOfStockPricePerShare, MinorityInterestOwnershipPercentageByParent, PropertyPlantAndEquipmentUsefulLife, TreasuryStockAcquiredAverageCostPerShare, Goodwill, SupplementalInformationForPropertyCasualtyInsuranceUnderwritersPriorYearClaimsAndClaimsAdjustmentExpense, CommonStockParOrStatedValuePerShare, OperatingLeaseWeightedAverageDiscountRatePercent, DebtInstrumentConvertibleConversionPrice1, AmortizationOfIntangibleAssets, PreferredStockSharesAuthorized, OperatingLeasePayments, DebtInstrumentMaturityDate, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodWeightedAverageGrantDateFairValue, EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate, AllocatedShareBasedCompensationExpense, PreferredStockDividendRatePercentage, StockRepurchaseProgramRemainingAuthorizedRepurchaseAmount1, TreasuryStockValueAcquiredCostMethod, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsVestedInPeriodTotalFairValue, IncomeTaxExpenseBenefit, DerivativeFixedInterestRate, RelatedPartyTransactionExpensesFromTransactionsWithRelatedParty, PublicUtilitiesRequestedRateIncreaseDecreaseAmount, RestructuringAndRelatedCostExpectedCost1, StockRepurchaseProgramAuthorizedAmount1, ShareBasedCompensation, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriodWeightedAverageGrantDateFairValue, LongTermDebtFairValue, LineOfCreditFacilityUnusedCapacityCommitmentFeePercentage, LineOfCreditFacilityCurrentBorrowingCapacity, ShareBasedCompensationArrangementByShareBasedPaymentAwardAwardVestingPeriod1, SharebasedCompensationArrangementBySharebasedPaymentAwardAwardVestingRightsPercentage, PaymentsToAcquireBusinessesGross, MinorityInterestOwnershipPercentageByNoncontrollingOwners, AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount, NumberOfReportableSegments, BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssumedIntangibleAssetsOtherThanGoodwill, OperatingLeaseCost, BusinessCombinationConsiderationTransferred1, UnrecognizedTaxBenefitsThatWouldImpactEffectiveTaxRate, CommonStockDividendsPerShareDeclared, AreaOfRealEstateProperty, LesseeOperatingLeaseTermOfContract, RevenueRemainingPerformanceObligation, RelatedPartyTransactionAmountsOfTransaction, InterestExpense, OperatingLeaseExpense, StockIssuedDuringPeriodSharesNewIssues, DebtInstrumentFaceAmount, CapitalizedContractCostAmortization, DebtInstrumentBasisSpreadOnVariableRate1, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsNonvestedNumber, GainsLossesOnExtinguishmentOfDebt, LineOfCreditFacilityRemainingBorrowingCapacity, OperatingLeaseRightOfUseAsset, OperatingLeaseWeightedAverageRemainingLeaseTerm1, OperatingLossCarryforwards, ConcentrationRiskPercentage1, GuaranteeObligationsMaximumExposure, StockRepurchasedAndRetiredDuringPeriodShares, LesseeOperatingLeaseRenewalTerm, ContractWithCustomerLiabilityRevenueRecognized, DefinedBenefitPlanContributionsByEmployer, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodGross, RepaymentsOfDebt, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognized, BusinessAcquisitionPercentageOfVotingInterestsAcquired, DebtInstrumentInterestRateEffectivePercentage, AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife, DebtInstrumentUnamortizedDiscount, ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAuthorized, BusinessCombinationContingentConsiderationLiability, DebtInstrumentInterestRateStatedPercentage, LeaseAndRentalExpense, RevenueFromContractWithCustomerExcludingAssessedTax, SharePrice, CommonStockSharesOutstanding, ContractWithCustomerLiability, DerivativeNotionalAmount, RevenueFromRelatedParties, ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsExercisesInPeriodTotalIntrinsicValue, Revenues, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedShareBasedAwardsOtherThanOptions, AccrualForEnvironmentalLossContingencies, ProceedsFromIssuanceOfCommonStock, EmployeeServiceShareBasedCompensationTaxBenefitFromCompensationExpense, IncomeLossFromEquityMethodInvestments, NumberOfOperatingSegments, UnrecognizedTaxBenefits, RevenueFromContractWithCustomerIncludingAssessedTax, LossContingencyDamagesSoughtValue, SharebasedCompensationArrangementBySharebasedPaymentAwardExpirationPeriod, TreasuryStockSharesAcquired, FiniteLivedIntangibleAssetUsefulLife, BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssumedIntangibles, EffectiveIncomeTaxRateContinuingOperations, LossContingencyEstimateOfPossibleLoss, ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAvailableForGrant, BusinessCombinationAcquisitionRelatedCosts, StockRepurchasedDuringPeriodShares, CashAndCashEquivalentsFairValueDisclosure, LineOfCreditFacilityInterestRateAtPeriodEnd, ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriod, CumulativeEffectOfNewAccountingPrincipleInPeriodOfAdoption, LettersOfCreditOutstandingAmount, EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedPeriodForRecognition1, NumberOfRealEstateProperties, DebtWeightedAverageInterestRate, SaleOfStockNumberOfSharesIssuedInTransaction, AssetImpairmentCharges, Depreciation, DebtInstrumentFairValue, DefinedContributionPlanCostRecognized, InterestExpenseDebt, LossContingencyPendingClaimsNumber, PaymentsToAcquireBusinessesNetOfCashAcquired, BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued, GoodwillImpairmentLoss, LineOfCredit, AmortizationOfFinancingCosts, EquityMethodInvestments, LineOfCreditFacilityCommitmentFeePercentage, LongTermDebt, LineOfCreditFacilityMaximumBorrowingCapacity, OperatingLeaseLiability
How to use
from johnsnowlabs import nlp, finance
import pyspark.sql.functions as F
spark = nlp.start()
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner_model = finance.NerModel.pretrained('finner_10q_xbrl', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
pipeline = nlp.Pipeline(stages=[documentAssembler,
sentence,
tokenizer,
embeddings,
ner_model
])
text = """Common Stock The authorized capital of the Company is 200,000,000 common shares , par value $ 0.001 , of which 12,481,724 are issued or outstanding ."""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ner_label"),
F.expr("cols['2']['confidence']").alias("confidence"))
result_df.show(50, truncate=100)
Results
+-----------+-------------------------------------+----------+
| token| ner_label|confidence|
+-----------+-------------------------------------+----------+
| Common| O| 1.0|
| Stock| O| 1.0|
| The| O| 1.0|
| authorized| O| 1.0|
| capital| O| 1.0|
| of| O| 1.0|
| the| O| 1.0|
| Company| O| 1.0|
| is| O| 1.0|
|200,000,000| B-CommonStockSharesAuthorized| 0.9932|
| common| O| 1.0|
| shares| O| 1.0|
| ,| O| 1.0|
| par| O| 1.0|
| value| O| 1.0|
| $| O| 1.0|
| 0.001|B-CommonStockParOrStatedValuePerShare| 0.9988|
| ,| O| 1.0|
| of| O| 1.0|
| which| O| 1.0|
| 12,481,724| B-CommonStockSharesOutstanding| 0.9649|
| are| O| 1.0|
| issued| O| 1.0|
| or| O| 1.0|
|outstanding| O| 1.0|
| .| O| 1.0|
+-----------+-------------------------------------+----------+
Model Information
| Model Name: | finner_10q_xbrl |
| Compatibility: | Finance NLP 1.0.0+ |
| License: | Licensed |
| Edition: | Official |
| Input Labels: | [sentence, token, embeddings] |
| Output Labels: | [ner] |
| Language: | en |
| Size: | 17.0 MB |
References
An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags.
Benchmarking
label tp fp fn prec rec f1
Macro-average 53613 10309 10243 0.8324958 0.8049274 0.8184795
Micro-average 53613 10309 10243 0.8387253 0.8395922 0.8391586