Financial NER for German Financial Statements

Description

This is a German NER model trained on German Financial Statements, aimed to extract the following entities from the documents.

Predicted Entities

financial_entity, financial_value

Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_german_financial_statements_bert","de") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")

tokenClassifier = finance.NerModel.pretrained("finner_german_financial_entities", "de", "finance/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
  documentAssembler,
    sentenceDetector,
      tokenizer,
    embeddings,
  tokenClassifier,
    ner_converter
    ]
)

import pandas as pd

p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))


text = 'Die Steuern vom Einkommen und Etrag in Hoehe von TEUR 11.621 (Vorjahr TEUR 8.915) betreffen das Umlaufvermoegen'

res = p_model.transform(spark.createDataFrame([[text]]).toDF("text"))

result_df = res.select(F.explode(F.arrays_zip(res.token.result,res.ner.result, res.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

Results

+---------------+------------------+----------+
|          token|             label|confidence|
+---------------+------------------+----------+
|            Die|                 O|    0.9998|
|        Steuern|B-financial_entity|    0.9999|
|            vom|I-financial_entity|       1.0|
|      Einkommen|I-financial_entity|       1.0|
|            und|I-financial_entity|    0.9999|
|          Etrag|I-financial_entity|       1.0|
|             in|                 O|    0.9998|
|          Hoehe|                 O|       1.0|
|            von|                 O|       1.0|
|           TEUR|                 O|       1.0|
|         11.621| B-financial_value|       1.0|
|              (|                 O|       1.0|
|        Vorjahr|                 O|       1.0|
|           TEUR|                 O|       1.0|
|          8.915|                 O|       1.0|
|              )|                 O|       1.0|
|      betreffen|                 O|       1.0|
|            das|                 O|    0.9999|
|Umlaufvermoegen|B-financial_entity|    0.9999|
+---------------+------------------+----------+

Model Information

Model Name: finner_german_financial_entities
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: de
Size: 16.3 MB

References

https://huggingface.co/datasets/fabianrausch/financial-entities-values-augmented and in house JSL corrections and data augmentation.

Benchmarking

             label  precision    recall  f1-score   support
B-financial_entity     0.9923    0.9983    0.9953      1813
 B-financial_value     1.0000    0.9920    0.9960      1369
I-financial_entity     0.9962    0.9998    0.9980      4148
                 O     0.9998    0.9989    0.9994     16197
          accuracy          -         -    0.9986     23527
         macro-avg     0.9971    0.9972    0.9971     23527
      weighted-avg     0.9986    0.9986    0.9986     23527