Extract Dates in Financial Documents

Description

This NER models uses bert_embeddings_sec_bert_base embeddings, trained on SEC documents and empowered with OntoNotes 2022, to extract DATES. This model is light but very accurate.

Predicted Entities

DATE

Copy S3 URI

How to use

document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokens = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")

ner = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("label")

text  = "For the fiscal year ended December 31, 2021, Amazon reported a profit of ..."
df = spark.createDataFrame([text], StringType()).toDF('text')

pipeline = nlp.Pipeline(stages = [document, sentence, tokens, embeddings, ner])
fit_model = pipeline.fit(df)

res = fit_model.transform(df)

res.select(F.explode(F.arrays_zip(res.token.result, 
                                            res.label.result)).alias("cols")) \
             .select(F.expr("cols['0']").alias("token"),
                     F.expr("cols['1']").alias("label")).show(truncate=50)

Results

+--------+------+
|   token| label|
+--------+------+
|     For|     O|
|     the|B-DATE|
|  fiscal|I-DATE|
|    year|I-DATE|
|   ended|I-DATE|
|December|I-DATE|
|      31|I-DATE|
|       ,|I-DATE|
|    2021|I-DATE|
|       ,|     O|
|  Amazon|     O|
|reported|     O|
|       a|     O|
|  profit|     O|
|      of|     O|
|       .|     O|
|       .|     O|
|       .|     O|
+--------+------+

Model Information

Model Name: finner_sec_dates
Compatibility: Spark NLP for Finance 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.6 MB

References

In-house annotations on SEC 10K filings, Ontonotes 2012

Benchmarking

label	  tp	 fp	     fn	     prec	     rec	     f1
B-DATE	  3572	 278	 252	 0.9277922	 0.9341004	 0.9309356
I-DATE	  4300	 339	 245	 0.92692393	 0.94609463	 0.9364112
macro-avg 7872   617     497     0.92735803  0.9400975   0.93368435
micro-avg 7872   617     497     0.9273177   0.94061416  0.93391865