10K Item Section Classifier

Description

This is a Multiclass classification model which identifies the item (section) number in a 10K filing.

Predicted Entities

section_1, section_2, section_3, section_7, section_8, section_10, section_12, section_13, section_14, section_15, section_1A, section_1B, section_7A, section_9A, section_9B

Download Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler() \
    .setInputCols(["text"]) \
    .setOutputCols("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

seq_classifier = finance.BertForSequenceClassification.pretrained("finclf_10k_items", "en", "finance/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("class")
    
pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, seq_classifier])

data = spark.createDataFrame([["These issues could negatively affect the timely collection of our U.S. government invoices."]]).toDF("text")

result = pipeline.fit(data).transform(data)

Results

+------------+
|      result|
+------------+
|[section_10]|
+------------+

Model Information

Model Name:	finclf_10k_items
Compatibility:	Finance NLP 1.0.0+
License:	Licensed
Edition:	Official
Input Labels:	[document, token]
Output Labels:	[class]
Language:	en
Size:	412.2 MB
Case sensitive:	true
Max sentence length:	512

References

Train dataset available here

Benchmarking

label         precision  recall  f1-score  support 
section_1     0.59       0.66    0.62      112     
section_10    0.73       0.72    0.72      137     
section_12    0.95       1.00    0.97      124     
section_13    0.93       0.94    0.94      212     
section_14    0.99       0.97    0.98      172     
section_15    0.91       0.84    0.87      139     
section_1A    0.85       0.86    0.85      92      
section_1B    0.70       0.64    0.67      233     
section_2     0.85       0.78    0.81      172     
section_3     0.60       0.69    0.64      224     
section_7     0.92       0.93    0.92      164     
section_7A    0.89       0.90    0.89      99      
section_8     0.80       0.97    0.88      72      
section_9A    0.91       0.93    0.92      75      
section_9B    0.77       0.63    0.69      147     
accuracy      -          -       0.81      2174    
macro-avg     0.83       0.83    0.83      2174    
weighted-avg  0.82       0.81    0.81      2174  

PREVIOUSClinical Findings to UMLS Code Pipeline

NEXTFinance-related Tweets Classifier