Legal NER for NDA (Definition of Confidential Information Clauses)

Description

This is a NER model, aimed to be run only after detecting the DEF_OF_CONF_INFO clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other model for that purpose). It will extract the following entities: CONF_INFO_FORM, and CONF_INFO_TYPE.

Predicted Entities

CONF_INFO_FORM, CONF_INFO_TYPE

Download Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentence_detector = nlp.SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("embeddings")\
        .setMaxSentenceLength(512)\
        .setCaseSensitive(True)

ner_model = legal.NerModel.pretrained("legner_nda_def_conf_info", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text = [""""Confidential Information" shall mean all written or oral information of a proprietary, intellectual, or similar nature relating to GT Solar's business, projects, operations, activities, or affairs whether of a technical or financial nature or otherwise (including, without limitation, reports, financial information, business plans and proposals, ideas, concepts, trade secrets, know-how, processes, and other technical or business information, whether concerning GT Solar' businesses or otherwise) which has not been publicly disclosed and which the Recipient acquires directly or indirectly from GT Solar, its officers, employees, affiliates, agents or representatives."""]

result = model.transform(spark.createDataFrame([text]).toDF("text"))


Results

+-------------+--------------+
|chunk        |ner_label     |
+-------------+--------------+
|written      |CONF_INFO_FORM|
|oral         |CONF_INFO_FORM|
|reports      |CONF_INFO_TYPE|
|information  |CONF_INFO_TYPE|
|plans        |CONF_INFO_TYPE|
|proposals    |CONF_INFO_TYPE|
|ideas        |CONF_INFO_TYPE|
|concepts     |CONF_INFO_TYPE|
|trade secrets|CONF_INFO_TYPE|
|know-how     |CONF_INFO_TYPE|
|processes    |CONF_INFO_TYPE|
|information  |CONF_INFO_TYPE|
+-------------+--------------+

Model Information

Model Name: legner_nda_def_conf_info
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 16.3 MB

References

In-house annotations on the Non-disclosure Agreements

Benchmarking

label           precision  recall  f1-score  support 
CONF_INFO_FORM  1.00       0.95    0.97      20      
CONF_INFO_TYPE  0.87       0.93    0.90      163     
micro-avg       0.88       0.93    0.90      183     
macro-avg       0.93       0.94    0.94      183     
weighted-avg    0.88       0.93    0.90      183