Legal NER on Subpoenas (Small)

Description

This is a Legal NER model which is trained using custom legal embeddings and is aimed to extract 19 entities from subpoenas. This is called a small version because it has been trained on more generic labels. The larger versions of this model will be available on models hub.

Predicted Entities

COURT, APPOINTMENT_DATE, DEADLINE_DATE, DOCUMENT_DATE_FROM, ADDRESS, APPOINTMENT_HOUR, DOCUMENT_DATE_TO, DOCUMENT_PERSON, DOCUMENT_DATE_YEAR, STATE, MATTER_VS, CASE, COUNTY, DOCUMENT_TOPIC, MATTER, SUBPOENA_DATE, SIGNER, RECEIVE, DOCUMENT_TYPE

Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("legal_word_embeddings", "en", "legal/models")\
            .setInputCols(["sentence","token"])\
            .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_bert_subpoenas_sm_le", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

Results

+-------------------+-------------+
|chunk              |label        |
+-------------------+-------------+
|summary disposition|DOCUMENT_TYPE|
+-------------------+-------------+

Model Information

Model Name: legner_subpoenas_sm
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.7 MB

References

In House annotated dataset

Benchmarking

                    precision    recall  f1-score   support
           ADDRESS       0.82      0.88      0.85        42
  APPOINTMENT_DATE       0.50      1.00      0.67         3
  APPOINTMENT_HOUR       1.00      1.00      1.00         2
              CASE       0.74      0.89      0.81        19
            COUNTY       0.33      0.50      0.40         2
             COURT       0.44      0.40      0.42        10
     DEADLINE_DATE       0.50      1.00      0.67         2
DOCUMENT_DATE_FROM       0.67      0.86      0.75         7
  DOCUMENT_DATE_TO       0.71      0.83      0.77         6
DOCUMENT_DATE_YEAR       0.50      0.50      0.50         4
   DOCUMENT_PERSON       0.82      0.79      0.81      1307
    DOCUMENT_TOPIC       0.63      0.62      0.62        94
     DOCUMENT_TYPE       0.87      0.89      0.88       783
            MATTER       0.92      0.86      0.89        94
         MATTER_VS       0.93      0.78      0.85        54
          RECEIVER       0.50      0.30      0.37        20
            SIGNER       0.62      0.65      0.63        20
             STATE       0.60      0.86      0.71        14
     SUBPOENA_DATE       0.24      0.57      0.33         7
         micro-avg       0.82      0.81      0.82      2490
         macro-avg       0.65      0.75      0.68      2490
      weighted-avg       0.82      0.81      0.82      2490