Legal Stock Option Agreement Document Binary Classifier (Longformer)

Description

The legclf_stock_option_agreement model is a Longformer Document Classifier used to classify if the document belongs to the class stock-option-agreement or not (Binary Classification).

Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification.

If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account.

Predicted Entities

stock-option-agreement, other

Download Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
     .setInputCols(["document"])\
     .setOutputCol("token")

embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\
      .setInputCols("document", "token")\
      .setOutputCol("embeddings")

sentence_embeddings = nlp.SentenceEmbeddings()\
    .setInputCols(["document", "embeddings"])\
    .setOutputCol("sentence_embeddings")\
    .setPoolingStrategy("AVERAGE")

doc_classifier = legal.ClassifierDLModel.pretrained("legclf_stock_option_agreement", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")

nlpPipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    embeddings,
    sentence_embeddings,
    doc_classifier])

df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text")

model = nlpPipeline.fit(df)

result = model.transform(df)

Results

+-------+
|result|
+-------+
|[stock-option-agreement]|
|[other]|
|[other]|
|[stock-option-agreement]|

Model Information

Model Name:	legclf_stock_option_agreement
Compatibility:	Legal NLP 1.0.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence_embeddings]
Output Labels:	[class]
Language:	en
Size:	21.4 MB

References

Legal documents, scrapped from the Internet, and classified in-house + SEC documents

Benchmarking

                  label    precision    recall    f1-score    support 
                  other         0.99      0.99        0.99        202 
 stock-option-agreement         0.98      0.98        0.98        100 
               accuracy            -         -        0.99        302 
              macro-avg         0.99      0.99        0.99        302 
           weighted-avg         0.99      0.99        0.99        302

PREVIOUSLegal Stock Option Agreement Document Binary Classifier (Bert Sentence Embeddings)

NEXTLegal Sub Advisory Agreement Document Binary Classifier (Bert Sentence Embeddings)