Deidentify (Large)

Description

Deidentify (Large) is a deidentification model. It identifies instances of protected health information in text documents, and it can either obfuscate them (e.g., replacing names with different, fake names) or mask them (e.g., replacing “2020-06-04” with “"). This model is useful for maintaining HIPAA compliance when dealing with text documents that contain protected health information.

Live Demo Open in Colab Download

How to use


deid = DeIdentificationModel.pretrained("deidentify_large", "en") \
        .setInputCols(["sentence", "token", "ner_chunk"]) \
        .setOutputCol("obfuscated") \
          .setMode("obfuscate")

val deid = DeIdentificationModel.pretrained("deidentify_large", "en")
        .setInputCols(Array("sentence", "token", "ner_chunk"))
        .setOutputCol("obfuscated") \
          .setMode("obfuscate")

Model Information

Model Name: deidentify_large
Type: deid
Compatibility: Spark NLP for Healthcare 2.5.5
License: Licensed
Edition: Official
Input Labels: [sentence, token, ner_chunk]
Output Labels: [obfuscated]
Language: en
Case sensitive: false

Data Source

The model was trained based on data from https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/