Legal Purchase agreement Document Classifier (Longformer)

Description

The legclf_purchase_agreement model is a Legal Longformer Document Classifier to classify if the document belongs to the class purchase-agreement or not (Binary Classification).

Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.

If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required.

Predicted Entities

other, purchase-agreement

Live Demo Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler() \
     .setInputCol("text") \
     .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
     .setInputCols(["document"])\
     .setOutputCol("token")

embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\
    .setInputCols("document", "token") \
    .setOutputCol("embeddings")

sembeddings = nlp.SentenceEmbeddings()\
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

docClassifier = nlp.ClassifierDLModel.pretrained("legclf_purchase_agreement", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")
    
nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    embeddings,
    sembeddings,
    docClassifier])
 
df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text")
model = nlpPipeline.fit(df)
result = model.transform(df)

Results

+-------+
| result|
+-------+
|[purchase-agreement]|
|[other]|
|[other]|
|[purchase-agreement]|

Model Information

Model Name: legclf_purchase_agreement
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: en
Size: 21.0 MB

References

Legal documents, scrapped from the Internet, and classified in-house + SEC documents

Benchmarking

             label  precision    recall  f1-score   support
             other       0.97      1.00      0.98        62
purchase-agreement       1.00      0.93      0.97        30
          accuracy          -         -      0.98        92
         macro-avg       0.98      0.97      0.97        92
      weighted-avg       0.98      0.98      0.98        92