Legal NER - License Grant Clauses (Md, Lighter version)

Description

This model aims to detect License grants / permissions in agreements, provided by a Subject (PERMISSION_SUBJECT) to a Recipient (PERMISSION_INDIRECT_OBJECT). THe permission itself is in PERMISSION tag.

This is a md (medium version) of the classifier, trained with more data and being more resistent to false positives.

This is also different from other permission models in that this only is lighter, non-transformer based.

Predicted Entities

PERMISSION, PERMISSION_SUBJECT, PERMISSION_OBJECT, PERMISSION_INDIRECT_OBJECT

Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained('legner_grants_md', 'en', 'legal/models')\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[documentAssembler,sentenceDetector,tokenizer,embeddings,ner_model,ner_converter])

import pandas as pd

p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))

text = """Fox grants to Licensee a limited, exclusive (except as otherwise may be provided in this Agreement), 
non-transferable (except as permitted in Paragraph 17(d)) right and license"""

res = p_model.transform(spark.createDataFrame([[text]]).toDF("text"))

from pyspark.sql import functions as F

res.select(F.explode(F.arrays_zip('token.result', 'label.result')).alias("cols")) \
               .select(F.expr("cols['0']").alias("token"),
                       F.expr("cols['1']").alias("ner_label"))\
               .show(20, truncate=100)

Results

+----------------+----------------------------+
|           token|                   ner_label|
+----------------+----------------------------+
|             Fox|        B-PERMISSION_SUBJECT|
|          grants|                           O|
|              to|                           O|
|        Licensee|B-PERMISSION_INDIRECT_OBJECT|
|               a|                           O|
|         limited|                B-PERMISSION|
|               ,|                I-PERMISSION|
|       exclusive|                I-PERMISSION|
|               (|                I-PERMISSION|
|          except|                I-PERMISSION|
|              as|                I-PERMISSION|
|       otherwise|                I-PERMISSION|
|             may|                I-PERMISSION|
|              be|                I-PERMISSION|
|        provided|                I-PERMISSION|
|              in|                I-PERMISSION|
|            this|                I-PERMISSION|
|       Agreement|                I-PERMISSION|
|              ),|                I-PERMISSION|
|non-transferable|                I-PERMISSION|
+----------------+----------------------------+

Model Information

Model Name: legner_grants_md
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.1 MB

References

Manual annotations on CUAD dataset

Benchmarking

label	 tp	 fp	 fn	 prec	 rec	 f1
I-PERMISSION	 111	 28	 37	 0.79856116	 0.75	 0.7735192
B-PERMISSION	 12	 3	 2	 0.8	 0.85714287	 0.82758623
B-PERMISSION_INDIRECT_OBJECT	 10	 1	 5	 0.90909094	 0.6666667	 0.7692308
B-PERMISSION_SUBJECT	 9	 1	 5	 0.9	 0.64285713	 0.74999994
Macro-average 142 33 52 0.68153036 0.5833334 0.72862015
Micro-average 142 33 52 0.81142855 0.73195875 0.76964766