Description
This pipeline is designed for profiling and benchmarking various de-identification models applied to clinical texts. It integrates multiple NER models and rule-based components that are commonly used for detecting and anonymizing protected health information (PHI). The pipeline includes models trained with embeddings_clinical, zero-shot NER models, regex matchers, text matchers, and contextual parsers. By consolidating these diverse approaches, it allows comprehensive evaluation and comparison of different de-identification strategies across clinical datasets.
The following models are included in this pipeline:
ner_deid_enriched, ner_deid_sd, ner_deid_subentity_augmented_langtest, ner_deid_generic_augmented_allUpperCased_langtest, ner_deid_subentity_augmented_v2, ner_deid_subentity_augmented, ner_deid_enriched_langtest, ner_deid_subentity_augmented_i2b2, ner_deid_subentity_augmented_docwise, ner_deid_large, ner_deid_large_langtest, ner_deid_augmented, ner_deid_generic_docwise, ner_deid_subentity_docwise, ner_deid_synthetic, ner_deidentify_dl, ner_deid_aipii, ner_deid_generic_augmented_langtest, ner_deid_generic_augmented, ner_deid_sd_large, plate_parser, date_of_death_parser, date_of_birth_parser, vin_parser, account_parser, ssn_parser, phone_parser, medical_record_parser, zip_parser, license_parser, age_parser, drug_parser, dln_parser, url_matcher, date_matcher, phone_matcher, state_matcher, zip_matcher, ip_matcher, email_matcher, country_matcher, zeroshot_ner_deid_subentity_merged_medium
Predicted Entities
ACCOUNT, AGE, BIOID, CITY, CONTACT, COUNTRY, DATE, DEVICE, DLN, DOCTOR, EMAIL, FAX, HEALTHPLAN, HOSPITAL, ID, IDNUM, LICENSE, LOCATION, LOCATION_OTHER, MEDICALRECORD, NAME, ORGANIZATION, PATIENT, PHONE, PROFESSION, SSN, STATE, STREET, URL, USERNAME, ZIP
How to use
from sparknlp.pretrained import PretrainedPipeline
ner_profiling_pipeline = PretrainedPipeline("ner_profiling_deidentification", "en", "clinical/models")
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco."""
ner_profiling_pipeline_result = ner_profiling_pipeline.fullAnnotate(text)[0]
ner_profiling_pipeline = nlp.PretrainedPipeline("ner_profiling_deidentification", "en", "clinical/models")
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco."""
ner_profiling_pipeline_result = ner_profiling_pipeline.fullAnnotate(text)[0]
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_deidentification", "en", "clinical/models")
val text = """Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco."""
val ner_profiling_pipeline_result = ner_profiling_pipeline.fullAnnotate(text)[0]
Results
******************** ner_deid_aipii Model Results ********************
[('Hendrickson', 'NAME'), ('2093-01-13', 'SSN'), ('John Green', 'STREET'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'SSN'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]
******************** ip_matcher Model Results ********************
[('203.120.223.13', 'IP')]
******************** ner_deid_large_langtest Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_enriched_langtest Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]
******************** ner_deid_subentity_augmented_langtest Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'PHONE'), ('60-year-old', 'AGE'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('#333-44-6666', 'PHONE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]
******************** ner_deid_sd_large Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'ID'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_subentity_augmented Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'DEVICE'), ('60-year-old', 'AGE'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'STATE')]
******************** ner_deid_generic_augmented_langtest Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_augmented Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('IP 203.120.223.13', 'CONTACT'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_subentity_augmented_i2b2 Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'ZIP'), ('John Green', 'DOCTOR'), (': 1231511863', 'IDNUM'), ('203.120.223.13', 'PHONE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('#333-44-6666', 'PHONE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]
******************** ner_deid_subentity_augmented_v2 Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'USERNAME'), ('60-year-old', 'AGE'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('#333-44-6666', 'IDNUM'), ('no:A334455B', 'PHONE'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'STATE')]
******************** ner_deid_sd Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('203.120.223.13', 'CONTACT'), ('01/13/93', 'DATE'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_generic_docwise Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('IP', 'NAME'), ('203.120.223.13', 'DATE'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('#333-44-6666', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227, 0295', 'CONTACT'), ('Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_large Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_generic_augmented Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'CONTACT'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('60-year-old', 'AGE'), ('Day Hospital', 'LOCATION'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'ID'), ('#333-44-6666', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deidentify_dl Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('Day Hospital', 'HOSPITAL'), ('01/13/93', 'DATE'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco.', 'CITY')]
******************** ner_deid_enriched Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'PHONE'), ('John Green', 'DOCTOR'), ('01/13/93', 'DATE'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]
******************** ner_deid_generic_augmented_allUpperCased_langtest Model Results ********************
[('Hendrickson, Ora', 'NAME'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'ID'), ('John Green', 'NAME'), ('1231511863', 'ID'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286, SSN', 'NAME'), ('#333-44-6666', 'ID'), ('no:A334455B', 'ID'), ('(302) 786-5227', 'CONTACT'), ('0295 Keats Street', 'LOCATION'), ('San Francisco', 'LOCATION')]
******************** ner_deid_subentity_docwise Model Results ********************
[('Hendrickson, Ora', 'PATIENT'), ('2093-01-13', 'DATE'), ('25', 'AGE'), ('719435', 'DEVICE'), ('John Green', 'DOCTOR'), ('1231511863', 'IDNUM'), ('203.120.223.13', 'DATE'), ('60-year-old', 'AGE'), ('01/13/93', 'DATE'), ('1HGBH41JXMN109286', 'IDNUM'), ('no:A334455B', 'IDNUM'), ('(302) 786-5227', 'PHONE'), ('0295 Keats Street', 'STREET'), ('San Francisco', 'CITY')]
Model Information
| Model Name: | ner_profiling_deidentification |
| Type: | pipeline |
| Compatibility: | Healthcare NLP 6.0.2+ |
| License: | Licensed |
| Edition: | Official |
| Language: | en |
| Size: | 2.7 GB |
Included Models
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel x 20
- NerConverter x 20
- ContextualParserModel x 12
- RegexMatcherInternalModel x 3
- TextMatcherInternalModel
- RegexMatcherInternalModel x 3
- TextMatcherInternalModel
- PretrainedZeroShotNER
- NerConverterInternalModel
- Finisher