Legal Health Document Classifier (EURLEX)

Description

European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU’s Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office.

Given a document, the legclf_health_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Health or not (Binary Classification) according to EuroVoc labels.

Predicted Entities

Health, Other

Copy S3 URI

How to use


document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")

doc_classifier = legal.ClassifierDLModel.pretrained("legclf_health_bert", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")

nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    embeddings,
    doc_classifier])

df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text")

model = nlpPipeline.fit(df)

result = model.transform(df)

Results


+-------+
|result|
+-------+
|[Health]|
|[Other]|
|[Other]|
|[Health]|

Model Information

Model Name: legclf_health_bert
Compatibility: Legal NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: en
Size: 21.8 MB

References

Train dataset available here

Benchmarking


       label precision recall  f1-score  support
      Health      0.92   0.93      0.93      635
       Other      0.92   0.91      0.91      536
    accuracy         -      -      0.92     1171
   macro-avg      0.92   0.92      0.92     1171
weighted-avg      0.92   0.92      0.92     1171