Financial Work experience Section Binary Classifier

Description

This model is a Binary Classifier (True, False) for the work_experience item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level.

If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Finance Workshop Tokenization & Splitting Tutorial (link here), namely:

  • Paragraph splitting (by multiline);
  • Splitting by headers / subheaders;
  • etc.

Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).

Predicted Entities

other, work_experience

Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler() \
     .setInputCol("text") \
     .setOutputCol("document")

useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

docClassifier = nlp.ClassifierDLModel.pretrained("finclf_work_experience_item", "en", "finance/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")
    
nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    useEmbeddings,
    docClassifier])
 
df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)

Results

+-------+
| result|
+-------+
|[work_experience]|
|[other]|
|[other]|
|[work_experience]|

Model Information

Model Name: finclf_work_experience_item
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [category]
Language: en
Size: 22.5 MB

References

Weak labelling on documents from Edgar database

Benchmarking

          label   precision    recall  f1-score   support
          other        0.94      0.98      0.96       432
work_experience        0.91      0.79      0.85       130
       accuracy           -         -      0.93       562
      macro-avg        0.93      0.88      0.90       562
   weighted-avg        0.93      0.93      0.93       562