NER on Legal Texts (CUAD, Silver corpus)

Description

This is a Legal Name Entity Recognition model, trained on a Silver version of the CUAD dataset. We say a corpus is on its “Silver” version when we use automatic labelling algorithms, rules, vocabularies, patterns and some predefined annotations.

The entities included are: “PERSON”: Person “LAW”: Mentioned law “PARTY”: A party signing the agreement “EFFDATE”: Date of the agreement “LOC”: A mentioned location “DATE”: Another date, not EFFDATE “DOC”: Type of the document “ORDINAL”: And ordinal number “ROLE”: A role of a person or party “PERCENT”: A percentage “ORG”: An generic tag for detecting organizations

You can several models trained on Golden versions of this dataset (annotated by our JSL in-house domain experts) in Models Hub, looking in the Legal library.

Predicted Entities

PERSON, LAW, PARTY, EFFDATE, LOC, DATE, DOC, ORDINAL, ROLE, PERCENT, ORG

Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentencizer = nlp.SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentences")\
        .setExplodeSentences(True)

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentences"])\
  .setOutputCol("token")
        
embeddings = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "en")\
    .setInputCols(["sentences", "token"])\
    .setOutputCol("embeddings")

jsl_ner = legal.NerModel.pretrained("legner_cuad_silver", "en", "legal/models")\
		.setInputCols(["sentences", "token", "embeddings"]) \
		.setOutputCol("jsl_ner")

jsl_ner_converter = nlp.NerConverter() \
		.setInputCols(["sentences", "token", "jsl_ner"]) \
		.setOutputCol("ner_chunk")
        
jsl_ner_pipeline = nlp.Pipeline().setStages([
				documentAssembler,
				sentencizer,
				tokenizer,
				embeddings,
				jsl_ner,
				jsl_ner_converter])

text = """December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender"."""

df = spark.createDataFrame([[text]]).toDF("text")

model = jsl_ner_pipeline.fit(df)
res = model.transform(df)

Results

+------------+---------+----------+
|       token|ner_label|confidence|
+------------+---------+----------+
|    December|   B-DATE|    0.4111|
|        2007|   B-DATE|    0.7867|
|SUBORDINATED|        O|    0.5373|
|        LOAN|    B-DOC|    0.9998|
|   AGREEMENT|    I-DOC|    0.8615|
|           .|        O|    0.9695|
|        THIS|        O|    0.9977|
|        LOAN|    B-DOC|    0.9995|
|   AGREEMENT|    I-DOC|    0.9982|
|          is|        O|    0.8592|
|        made|        O|    0.9975|
|          on|        O|    0.9906|
|         7th|   B-DATE|    0.7804|
|    December|   B-DATE|    0.6701|
|           ,|   B-DATE|    0.5395|
|        2007|   B-DATE|    0.5327|
|     BETWEEN|        O|    0.9771|
|           :|        O|    0.9497|
|           (|        O|    0.7493|
|           1|        O|    0.9081|
|           )|        O|    0.4178|
|    SILICIUM|    B-ORG|    0.6731|
|          DE|    B-ORG|    0.3681|
|    PROVENCE|    B-ORG|    0.5065|
|       S.A.S|    B-ORG|    0.8924|
|          .,|        O|    0.7006|
|           a|        O|    0.9722|
|     private|        O|    0.9938|
|     company|        O|    0.9982|
|        with|        O|    0.9958|
|     limited|        O|     0.981|
|   liability|        O|    0.9994|
|           ,|        O|    0.9933|
|incorporated|        O|    0.9997|
|       under|        O|    0.9597|
|         the|        O|    0.9833|
|        laws|        O|    0.9969|
|          of|        O|    0.7129|
|      France|    B-LOC|    0.8789|
+------------+---------+----------+

Model Information

Model Name: legner_cuad_silver
Type: legal
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 15.0 MB

References

Manual rules, patterns, weak-labelling, preannotations from in-house models and from CUAD dataset

Benchmarking

label             tp       fp    fn    prec        rec          f1
B-PERSON          89       11    11    0.89        0.89         0.89
B-LAW             759      111   148   0.8724138   0.8368247    0.8542487
I-PARTY           8632     47    23    0.9945846   0.9973426    0.9959617
B-EFFDATE         9        1     4     0.9         0.6923077    0.7826087
B-LOC             372      76    61    0.83035713  0.8591224    0.8444949
B-DATE            1020     104   102   0.9074733   0.90909094   0.9082814
B-DOC             1370     36    12    0.97439545  0.9913169    0.9827834
I-EFFDATE         14       0     0     1.0         1.0          1.0
I-DOC             2227     49    0     0.978471    1.0          0.98911834
B-ORDINAL         99       11    15    0.9         0.8684211    0.8839286
B-ROLE            228      6     0     0.974359    1.0          0.987013
B-PERCENT         34       4     0     0.8947368   1.0          0.9444445
B-ORG          	  1992     478   624   0.8064777   0.7614679    0.7833268
B-PARTY        	  2275     39    82    0.9831461   0.96521      0.97409546
Macro-average     19120    973   1082  0.92188674  0.9122217    0.9170287
Micro-average     19120    973   1082  0.95157516  0.94644094   0.9490011