Description
This model is a Binary Classifier (True, False) for the risk_factors
item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level.
If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link here), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.
Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).
Predicted Entities
other
, risk_factors
How to use
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = nlp.ClassifierDLModel.pretrained("finclf_risk_factors_item", "en", "finance/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
useEmbeddings,
docClassifier])
df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)
Results
+-------+
| result|
+-------+
|[risk_factors]|
|[other]|
|[other]|
|[risk_factors]|
Model Information
Model Name: | finclf_risk_factors_item |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence_embeddings] |
Output Labels: | [category] |
Language: | en |
Size: | 22.6 MB |
References
Weak labelling on documents from Edgar database
Benchmarking
label precision recall f1-score support
other 0.92 0.92 0.92 1277
risk_factors 0.91 0.92 0.91 1228
accuracy - - 0.92 2505
macro-avg 0.92 0.92 0.92 2505
weighted-avg 0.92 0.92 0.92 2505