Legal NER (Parties, Dates, Alias, Former names, Document Type - lg)

Description

MPORTANT: Don’t run this model on the whole legal agreement. Instead:

Split by paragraphs. You can use notebook 1 in Finance or Legal as inspiration;
Use the legclf_introduction_clause Text Classifier to select only these paragraphs;

This is a Legal NER Model, aimed to process the first page of the agreements when information can be found about:

Parties of the contract/agreement;
Their former names;
Aliases of those parties, or how those parties will be called further on in the document;
Document Type;
Effective Date of the agreement;
Other organizations;

This model can be used all along with its Relation Extraction model to retrieve the relations between these entities, called legre_contract_doc_parties

Predicted Entities

PARTY, EFFDATE, DOC, ALIAS, ORG, FORMER_NAME

Live Demo Download Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
        .setInputCols("sentence", "token") \
        .setOutputCol("embeddings")\
        .setMaxSentenceLength(512)\
        .setCaseSensitive(True)

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties_lg', 'en', 'legal/models')\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text = ["""
INTELLECTUAL PROPERTY AGREEMENT

This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""]

res = model.transform(spark.createDataFrame([text]).toDF("text"))

Results

+-----------------------------------+-----------+
|chunk                              |ner_label  |
+-----------------------------------+-----------+
|INTELLECTUAL PROPERTY AGREEMENT    |DOC        |
|December 31, 2018                  |EFFDATE    |
|Armstrong Flooring, Inc            |PARTY      |
|Seller                             |ALIAS      |
|AFI Licensing LLC                  |PARTY      |
|Licensing                          |ALIAS      |
|Seller                             |PARTY      |
|Arizona                            |ALIAS      |
|AHF Holding, Inc.                  |ORG        |
|Tarzan HoldCo, Inc                 |FORMER_NAME|
|Buyer                              |ALIAS      |
|Armstrong Hardwood Flooring Company|PARTY      |
|Company                            |ALIAS      |
|Buyer                              |PARTY      |
|Buyer Entities                     |ALIAS      |
|Arizona                            |PARTY      |
|Buyer Entities                     |PARTY      |
|Party                              |ALIAS      |
|Parties                            |ALIAS      |
+-----------------------------------+-----------+

Model Information

Model Name:	legner_contract_doc_parties_lg
Compatibility:	Legal NLP 1.0.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	16.3 MB

References

Manual annotations on CUAD dataset

Benchmarking

label          precision  recall  f1-score  support 
B-ALIAS        0.95       0.95    0.95      193     
B-DOC          0.87       0.85    0.86      118     
I-DOC          0.92       0.83    0.87      245     
B-PARTY        0.83       0.79    0.81      246     
I-PARTY        0.90       0.88    0.89      630     
B-ORG          0.91       0.84    0.87      207     
I-ORG          0.93       0.87    0.90      355     
I-ALIAS        0.77       0.83    0.80      29      
B-EFFDATE      0.91       0.91    0.91      81      
I-EFFDATE      0.95       0.97    0.96      261     
B-FORMER_NAME  0.97       1.00    0.99      39      
I-FORMER_NAME  0.99       1.00    0.99      93      
micro-avg      0.91       0.88    0.90      2497    
macro-avg      0.91       0.89    0.90      2497    
weighted-avg   0.91       0.88    0.90      2497    

PREVIOUSResolver Company Names to Tickers using Nasdaq Stock Screener

NEXTHocr for table recognition