Protect Sensitive Data While Preserving Value
Working with sensitive documents containing Personally Identifiable Information (PII) or Protected Health Information (PHI) doesn’t have to be complicated. Generative AI Lab’s De-Identification feature helps you automatically detect and anonymize sensitive data—ensuring compliance with privacy regulations like HIPAA and GDPR while keeping your data useful for analysis and model training.
Why De-Identification Matters
- Stay Compliant: Meet HIPAA, GDPR, and other privacy regulations without manual redaction
- Share Data Safely: Collaborate with external teams or researchers without exposing sensitive information
- Accelerate AI Development: Train models on realistic data without privacy risks
- Reduce Risk: Minimize data breaches by removing PII/PHI before sharing or storing documents
How It Works: Three Simple Steps
1. Create a De-Identification Project
Start by selecting the De-identification template when creating a new project. The system automatically:
- Downloads the state-of-the-art clinical_deidentification pipeline
- Sets up optimized labels for common sensitive entities (names, dates, IDs, etc.)
- Configures privacy-preserving export settings

Quick Start: Import the Demo De-Identification Project from the Import menu to explore the feature with sample data before working on real documents.
2. Choose Your De-Identification Strategy
\ Select how you want to handle sensitive information—you can even use different strategies for different entity types: \
| Strategy | What It Does | Example | Best For |
|---|---|---|---|
| Mask with Entity Labels | Replaces text with label names | <Patient> is a <Age> y.o. | Maintaining document structure |
| Mask with Characters | Replaces each character with * | **** ****** is a ** y.o. | Preserving text length |
| Fixed Length Mask | Replaces with fixed **** | **** **** is a **** y.o. | Uniform anonymization |
| Obfuscation | Replaces with realistic fake data | Mark Doe is a 48 y.o. | Maintaining data realism |
Example: For “John Davies is a 62 y.o. patient admitted to ICU after an MVA”:
- Entity Labels → “<Patient> is a <Age> y.o. patient admitted to ICU after an MVA”
- Obfuscation → “Mark Doe is a 48 y.o. patient admitted to ICU after an MVA”

Pro Tip: You can upload custom obfuscation rules in JSON format to reuse configurations across projects.

3. De-Identify and Export
Once your strategy is configured:
- Import your documents - They’ll be tagged as NonDeidentified for easy tracking
- Click “De-identify” - The AI automatically detects and labels sensitive entities
- Review and adjust - Make any corrections on the labeling page
- Submit completions - Mark reviewed tasks with a star
- Export safely - Only de-identified, human-verified data is exported

Advanced Features
Multi-Source Detection
Combine multiple approaches for maximum accuracy:
- Pre-trained models for common entity types
- Custom rules for domain-specific patterns
- Prompts for contextual detection
- John Snow Labs De-Identification Pipeline for state-of-the-art performance

Compare Original and De-Identified
View side-by-side comparisons of original and anonymized text to verify results before export.

Re-Import for Further Processing
Export de-identified documents and re-import them into any text project for:
- Additional annotation by team members
- Model training on privacy-safe data
- Further analysis without compliance concerns
Best Practices
✅ Target Only Sensitive Entities: Only label data that needs protection—everything labeled will be transformed during export
✅ Verify Before Export: Always review AI predictions to ensure accuracy. Only starred (submitted) completions are exported
✅ Use Entity-Specific Strategies: Apply obfuscation to names for realism, but mask IDs with fixed characters for simplicity
✅ Test with Demo Project: Familiarize yourself with workflows using the built-in demo before processing real data
✅ Leverage Custom Rules: Combine pre-trained models with custom rules to catch domain-specific patterns
Quick Reference
Creating a Project
From the Import menu:
- Select Import Demo Projects for a pre-configured example, OR
- Create a new project and select De-identification template under the TEXT tab
Tip: Look for the de-identification icon in the bottom-left corner of project cards to quickly identify de-identification projects.
Export Workflow
The export process automatically handles prioritization:
- Submitted completions (starred by annotators) are exported first
- If no submissions exist, de-identified predictions are exported
- With multiple submissions, the highest priority user’s completion is used (configured on the Teams page)

Tags and Status
- NonDeidentified tag: Automatically added to new imports for tracking
- Status indicators: Green (success), red (error), grey (pending) next to each task
- (De-identified) label: Shown next to completion IDs after de-identification

Important Notes
License Required: A valid John Snow Labs license is needed to access de-identification features.
Project Type is Permanent: Projects must be designated as de-identification projects during creation—you cannot convert existing projects to de-identification projects later.
Human Review Required: Pre-annotations alone aren’t exported. Each task must be reviewed and starred by a human to ensure quality.
Unified Workflow: The De-identify button works consistently across all project types (pipelines, models, rules) for a predictable experience.
Need Help?
- Getting Started: Import the Demo De-Identification Project from the Import menu
- Custom Configurations: Upload JSON obfuscation rules via the Customize Labels page
- Pipeline Management: View all downloaded pipelines on the dedicated Pipelines page
- Support: Contact support@johnsnowlabs.com for assistance