Financial 10K Filings NER (Summary page, SEC embeddings)

Description

IMPORTANT: Don’t run this model on the whole financial report. Instead:

  • Split by paragraphs;
  • Use the finclf_form_10k_summary_item Text Classifier to select only these paragraps;

This Financial NER Model is aimed to process the first summary page of 10K filings and extract the information about the Company submitting the filing, trading data, address / phones, CFN, IRS, etc.

This is another version of finner_10k_summary trained with better financial embeddings (bert_embeddings_sec_bert_base)

Predicted Entities

ADDRESS, CFN, FISCAL_YEAR, IRS, ORG, PHONE, STATE, STOCK_EXCHANGE, TICKER, TITLE_CLASS, TITLE_CLASS_VALUE

Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setCustomBounds(["\n\n"])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

ner_model = finance.NerModel.pretrained("finner_sec_10k_summary","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")\

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
    ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

data = spark.createDataFrame([["""ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES AND EXCHANGE ACT OF 1934
For the annual period ended January 31, 2021
or
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from________to_______
Commission File Number: 001-38856
PAGERDUTY, INC.
(Exact name of registrant as specified in its charter)
Delaware
27-2793871
(State or other jurisdiction of
incorporation or organization)
(I.R.S. Employer
Identification Number)
600 Townsend St., Suite 200, San Francisco, CA 94103
(844) 800-3889
(Address, including zip code, and telephone number, including area code, of registrant’s principal executive offices)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading symbol(s)
Name of each exchange on which registered
Common Stock, $0.000005 par value,
PD
New York Stock Exchange"""]]).toDF("text")

result = model.transform(data)

result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
               .select(F.expr("cols['0']").alias("ticker"),
                       F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False)

Results

+----------------------------------------------+-----------------+
|ticker                                        |label            |
+----------------------------------------------+-----------------+
|January 31, 2021                              |FISCAL_YEAR      |
|001-38856                                     |CFN              |
|PAGERDUTY, INC                                |ORG              |
|Delaware                                      |STATE            |
|27-2793871                                    |IRS              |
|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS          |
|(844) 800-3889                                |PHONE            |
|Common Stock                                  |TITLE_CLASS      |
|$0.000005                                     |TITLE_CLASS_VALUE|
|PD                                            |TICKER           |
|New York Stock Exchange                       |STOCK_EXCHANGE   |
+----------------------------------------------+-----------------+

Model Information

Model Name: finner_sec_10k_summary
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.5 MB

References

Manual annotations on 10-K Filings

Benchmarking

             label      precision    recall  f1-score   support
      B-TITLE_CLASS       1.00      1.00      1.00        15
      I-TITLE_CLASS       1.00      1.00      1.00        21
              B-ORG       0.84      0.66      0.74        62
              I-ORG       0.88      0.76      0.82        93
   B-STOCK_EXCHANGE       0.86      0.86      0.86        14
   I-STOCK_EXCHANGE       0.98      0.98      0.98        50
            B-PHONE       0.95      0.87      0.91        23
            I-PHONE       0.95      1.00      0.98        60
            B-STATE       0.89      0.85      0.87        20
              B-IRS       1.00      0.88      0.93        16
          B-ADDRESS       0.94      0.83      0.88        18
          I-ADDRESS       0.92      0.97      0.94       144
           B-TICKER       0.86      0.92      0.89        13
      B-FISCAL_YEAR       0.96      0.88      0.92        50
      I-FISCAL_YEAR       0.93      0.92      0.92       125
B-TITLE_CLASS_VALUE       1.00      0.93      0.97        15
              B-CFN       0.92      1.00      0.96        12
          micro-avg       0.93      0.89      0.91       751
          macro-avg       0.84      0.81      0.82       751
       weighted-avg       0.92      0.89      0.91       751