Financial NER (Headers / Subheaders)

Description

This is a Named Entity Recognition model, which will help you split long financial documents into smaller sections. To do that, it detects Headers and Subheaders of different sections. You can then use the beginning and end information in the metadata to retrieve the text between those headers.

This model has been trained on 10-K filings, with the following HEADER and SUBHEADERS annotation guidelines:

  • PART I, PART II, etc are HEADERS
  • Item 1, Item 2, etc are also HEADERS
  • Item 1A, 2B, etc are SUBHEADERS
  • 1., 2., 2.1, etc. are SUBHEADERS
  • Other kind of short section names are also SUBHEADERS

For more information about long document splitting, see this workshop entry.

Predicted Entities

HEADER, SUBHEADER

Live Demo Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_headers', 'en', 'finance/models')\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text = ["""
2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.

2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements."""]

res = model.transform(spark.createDataFrame([text]).toDF("text"))

Results

+-----------+-----------+----------+
|      token|  ner_label|confidence|
+-----------+-----------+----------+
|          2|          O|     0.576|
|          .|          O|    0.9612|
|Definitions|B-SUBHEADER|    0.9993|
|          .|          O|    0.9755|
|        For|          O|    0.9966|
|   purposes|          O|    0.9863|
|         of|          O|    0.9878|
|       this|          O|    0.9974|
|  Agreement|          O|    0.9994|
|          ,|          O|    0.9999|
|        the|          O|       1.0|
|  following|          O|       1.0|
|      terms|          O|       1.0|
|       have|          O|       1.0|
|        the|          O|       1.0|
|   meanings|          O|       1.0|
|   ascribed|          O|       1.0|
|    thereto|          O|       1.0|
|         in|          O|       1.0|
|       this|          O|       1.0|
|    Section|          O|    0.9985|
|          1|          O|    0.9999|
|          .|          O|    0.9972|
|          2|          O|    0.9686|
|          .|          O|    0.9834|
|Appointment|B-SUBHEADER|     0.767|
|         as|I-SUBHEADER|    0.9479|
|   Reseller|I-SUBHEADER|    0.8429|
|          .|          O|    0.9944|
|        2.1|B-SUBHEADER|    0.6278|
|Appointment|I-SUBHEADER|    0.6599|
|          .|          O|    0.9972|
|        The|          O|    0.9987|
|    Company|          O|    0.9889|
|     hereby|          O|    0.9914|
|      [***]|          O|    0.9996|
|          .|          O|    0.9999|
| Allscripts|          O|    0.9843|
|        may|          O|    0.9989|
|       also|          O|    0.9967|
|   disclose|          O|    0.9949|
|  Company's|          O|    0.9976|
|    pricing|          O|    0.9999|
|information|          O|    0.9999|
|   relating|          O|    0.9999|
|         to|          O|    0.9998|
|        its|          O|    0.9992|
|   Merchant|          O|    0.9671|
| Processing|          O|    0.8411|
|   Services|          O|    0.9662|
+-----------+-----------+----------+

Model Information

Model Name: finner_headers
Type: finance
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.3 MB

References

In-house annotations on 10-k filings

Benchmarking

label            tp      fp      fn      prec         rec          f1
I-HEADER         2835    9       8       0.996835     0.9971860    0.9970107
B-SUBHEADER      963     135     131     0.877049     0.8802559    0.87864965
I-SUBHEADER      2573    219     152     0.921561     0.9442202    0.9327533
B-HEADER         425     1       1       0.997652     0.9976526    0.9976526
Macro-average    6796    364     292     0.948274     0.9548287    0.95154047
Micro-average    6796    364     292     0.949162     0.9588036    0.9539584