Named Entity Recognition Profiling (deidentification)

Description

This pipeline is designed for profiling and benchmarking various de-identification models applied to clinical texts. It integrates multiple NER models and rule-based components that are commonly used for detecting and anonymizing protected health information (PHI). The pipeline includes models trained with embeddings_clinical, zero-shot NER models, regex matchers, text matchers, and contextual parsers. By consolidating these diverse approaches, it allows comprehensive evaluation and comparison of different de-identification strategies across clinical datasets.

The following models are included in this pipeline: ner_deid_enriched, ner_deid_sd, ner_deid_subentity_augmented_langtest, ner_deid_generic_augmented_allUpperCased_langtest, ner_deid_subentity_augmented_v2, ner_deid_subentity_augmented, ner_deid_enriched_langtest, ner_deid_subentity_augmented_i2b2, ner_deid_subentity_augmented_docwise, ner_deid_large, ner_deid_large_langtest, ner_deid_augmented, ner_deid_generic_docwise, ner_deid_subentity_docwise, ner_deid_synthetic, ner_deidentify_dl, ner_deid_aipii, ner_deid_generic_augmented_langtest, ner_deid_generic_augmented, ner_deid_sd_large, plate_parser, date_of_death_parser, date_of_birth_parser, vin_parser, account_parser, ssn_parser, phone_parser, medical_record_parser, zip_parser, license_parser, age_parser, drug_parser, dln_parser, url_matcher, date_matcher, phone_matcher, state_matcher, zip_matcher, ip_matcher, email_matcher, country_matcher, zeroshot_ner_deid_subentity_merged_medium

Predicted Entities

ACCOUNT, AGE, BIOID, CITY, CONTACT, COUNTRY, DATE, DEVICE, DLN, DOCTOR, EMAIL, FAX, HEALTHPLAN, HOSPITAL, ID, IDNUM, LICENSE, LOCATION, LOCATION_OTHER, MEDICALRECORD, NAME, ORGANIZATION, PATIENT, PHONE, PROFESSION, SSN, STATE, STREET, URL, USERNAME, ZIP

Copy S3 URI

How to use

from sparknlp.pretrained import PretrainedPipeline

ner_profiling_pipeline = PretrainedPipeline("ner_profiling_deidentification", "en", "clinical/models")

text = """Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco."""

ner_profiling_pipeline_result = ner_profiling_pipeline.fullAnnotate(text)[0]

ner_profiling_pipeline = nlp.PretrainedPipeline("ner_profiling_deidentification", "en", "clinical/models")

text = """Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco."""

ner_profiling_pipeline_result = ner_profiling_pipeline.fullAnnotate(text)[0]

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline

val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_deidentification", "en", "clinical/models")

val text = """Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco."""

val ner_profiling_pipeline_result = ner_profiling_pipeline.fullAnnotate(text)[0]

Results


******************** ner_deid_aipii Model Results ******************** 

[('Hendrickson', 'NAME'), ('2093-01-13', 'SSN'), ('John Green', 'STREET'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'SSN'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]

******************** ip_matcher Model Results ******************** 

[('203.120.223.13', 'IP')]

******************** ner_deid_large_langtest Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_enriched_langtest Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]

******************** ner_deid_subentity_augmented_langtest Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'PHONE'), ('60-year-old', 'AGE'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('#333-44-6666', 'PHONE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]

******************** ner_deid_sd_large Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'ID'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_subentity_augmented Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'DEVICE'), ('60-year-old', 'AGE'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'STATE')]

******************** ner_deid_generic_augmented_langtest Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_augmented Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('IP 203.120.223.13', 'CONTACT'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_subentity_augmented_i2b2 Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'ZIP'), ('John Green', 'DOCTOR'), (': 1231511863', 'IDNUM'), ('203.120.223.13', 'PHONE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('#333-44-6666', 'PHONE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]

******************** ner_deid_subentity_augmented_v2 Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'USERNAME'), ('60-year-old', 'AGE'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('#333-44-6666', 'IDNUM'), ('no:A334455B', 'PHONE'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'STATE')]

******************** ner_deid_sd Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('203.120.223.13', 'CONTACT'), ('01/13/93', 'DATE'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_generic_docwise Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('IP', 'NAME'), ('203.120.223.13', 'DATE'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('#333-44-6666', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227, 0295', 'CONTACT'), ('Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_large Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_generic_augmented Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('60-year-old', 'AGE'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('#333-44-6666', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deidentify_dl Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco.', 'CITY')]

******************** ner_deid_enriched Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('01/13/93', 'DATE'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]

******************** ner_deid_generic_augmented_allUpperCased_langtest Model Results ******************** 

[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'ID'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286, SSN', 'NAME'), ('#333-44-6666', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]

******************** ner_deid_subentity_docwise Model Results ******************** 

[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'DEVICE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'DATE'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]

Model Information

Model Name: ner_profiling_deidentification
Type: pipeline
Compatibility: Healthcare NLP 6.0.2+
License: Licensed
Edition: Official
Language: en
Size: 2.7 GB

Included Models

  • DocumentAssembler
  • SentenceDetectorDLModel
  • TokenizerModel
  • WordEmbeddingsModel
  • MedicalNerModel x 20
  • NerConverter x 20
  • ContextualParserModel x 12
  • RegexMatcherInternalModel x 3
  • TextMatcherInternalModel
  • RegexMatcherInternalModel x 3
  • TextMatcherInternalModel
  • PretrainedZeroShotNER
  • NerConverterInternalModel
  • Finisher