Description
This is a Legal NER model aimed to extract 19 entities from subpoenas. This is called a small version because it has been trained on more generic labels. The larger versions of this model will be available on models hub.
Predicted Entities
COURT
, APPOINTMENT_DATE
, DEADLINE_DATE
, DOCUMENT_DATE_FROM
, ADDRESS
, APPOINTMENT_HOUR
, DOCUMENT_DATE_TO
, DOCUMENT_PERSON
, DOCUMENT_DATE_YEAR
, STATE
, MATTER_VS
, CASE
, COUNTY
, DOCUMENT_TOPIC
, MATTER
, SUBPOENA_DATE
, SIGNER
, RECEIVER
, DOCUMENT_TYPE
How to use
from pyspark.sql import functions as F
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")\
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
ner_model = legal.BertForTokenClassification.pretrained("legner_bert_subpoenas_sm", "en", "legal/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
text = """In addition , in an earlier motion for summary disposition in which all Respondents joined , and which this Court denied in its Order of April30 , 2013 , Respondent Deloitte Touche Tohmatsu Certified Public Accountants Ltd ."""
data = spark.createDataFrame([[text]]).toDF("text")
result = model.transform(data)
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False)
Results
+------------------------+---------------+
|chunk |label |
+------------------------+---------------+
|summary disposition |DOCUMENT_TYPE |
|Deloitte Touche Tohmatsu|DOCUMENT_PERSON|
+------------------------+---------------+
Model Information
Model Name: | legner_bert_subpoenas_sm |
Compatibility: | Legal NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [document, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 401.1 MB |
Case sensitive: | true |
Max sentence length: | 128 |
References
In House annotated dataset
Benchmarking
label precision recall f1-score support
B-COURT 1.00 0.60 0.75 30
I-APPOINTMENT_DATE 0.57 0.65 0.60 20
I-COURT 0.93 0.89 0.91 166
B-APPOINTMENT_DATE 0.67 0.44 0.53 9
I-DEADLINE_DATE 0.83 0.26 0.40 19
B-DOCUMENT_DATE_FROM 0.80 1.00 0.89 16
I-ADDRESS 0.87 0.94 0.90 1046
B-APPOINTMENT_HOUR 0.43 0.92 0.59 13
B-DOCUMENT_DATE_TO 0.88 1.00 0.93 7
I-APPOINTMENT_HOUR 1.00 0.15 0.26 20
B-DOCUMENT_PERSON 0.79 0.84 0.82 2919
B-DOCUMENT_DATE_YEAR 0.00 0.00 0.00 5
B-STATE 0.59 0.79 0.68 24
I-MATTER_VS 0.65 0.79 0.71 150
I-CASE 0.00 0.00 0.00 11
I-COUNTY 0.00 0.00 0.00 0
B-DOCUMENT_TOPIC 0.64 0.77 0.70 208
B-COUNTY 0.00 0.00 0.00 0
B-MATTER 0.85 0.86 0.86 328
I-DOCUMENT_DATE_FROM 0.87 1.00 0.93 48
I-SUBPOENA_DATE 0.56 0.28 0.38 53
I-SIGNER 0.56 0.46 0.50 59
I-DOCUMENT_DATE_TO 0.83 1.00 0.91 25
I-RECEIVER 0.71 0.52 0.60 98
B-SIGNER 0.76 0.49 0.59 39
I-DOCUMENT_TOPIC 0.83 0.80 0.81 725
I-STATE 0.67 0.29 0.40 14
B-MATTER_VS 0.78 0.82 0.80 136
I-DOCUMENT_TYPE 0.83 0.87 0.85 621
B-DEADLINE_DATE 0.00 0.00 0.00 6
I-MATTER 0.88 0.82 0.85 479
B-DOCUMENT_TYPE 0.87 0.90 0.88 1714
B-ADDRESS 0.81 0.83 0.82 101
B-SUBPOENA_DATE 0.42 0.28 0.33 18
B-CASE 0.91 0.97 0.94 312
I-DOCUMENT_PERSON 0.80 0.83 0.81 3672
B-RECEIVER 0.76 0.63 0.69 46
micro-avg 0.82 0.84 0.83 13157
macro-avg 0.66 0.61 0.61 13157
weighted-avg 0.82 0.84 0.83 13157