Description
The legclf_supplemental_indenture_agreement
model is a Longformer Document Classifier used to classify if the document belongs to the class supplemental-indenture
or not (Binary Classification).
Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.
If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account.
Predicted Entities
supplemental-indenture
, other
How to use
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\
.setInputCols("document", "token")\
.setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings()\
.setInputCols(["document", "embeddings"])\
.setOutputCol("sentence_embeddings")\
.setPoolingStrategy("AVERAGE")
doc_classifier = legal.ClassifierDLModel.pretrained("legclf_supplemental_indenture_agreement", "en", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
sentence_embeddings,
doc_classifier])
df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)
Results
+-------+
|result|
+-------+
|[supplemental-indenture]|
|[other]|
|[other]|
|[supplemental-indenture]|
Model Information
Model Name: | legclf_supplemental_indenture_agreement |
Compatibility: | Legal NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence_embeddings] |
Output Labels: | [class] |
Language: | en |
Size: | 21.6 MB |
References
Legal documents, scrapped from the Internet, and classified in-house + SEC documents
Benchmarking
label precision recall f1-score support
other 0.97 0.95 0.96 221
supplemental-indenture 0.90 0.94 0.92 107
accuracy - - 0.95 328
macro-avg 0.94 0.95 0.94 328
weighted-avg 0.95 0.95 0.95 328