PDF Obfuscate Multilingual Name Plus

Description

The Clinical Obfuscation for PDF Pipeline is a powerful solution for transforming sensitive PDF documents into safe, shareable assets. It enables organizations to unlock the value of clinical data while ensuring strict compliance with HIPAA, GDPR, and institutional privacy standards. The pipeline is designed to obfuscate Personally Identifiable Information (PII) in input PDF documents, while preserving readability and format integrity. Its key features include:

Entity-Level Obfuscation

Detected entities (e.g., names, dates, IDs) are individually replaced with synthetic but realistic alternatives, ensuring sensitive information is protected.

Layout-Aware Replacement

Replacement entities are carefully chosen to match the visual space of the originals, avoiding issues like text overflow or underfitting.

Supported Entity Types

The pipeline can obfuscate the following types of information: HOSPITAL, NAME, PATIENT, ID, MEDICALRECORD, IDNUM, COUNTRY, LOCATION, STREET, STATE, ZIP, CONTACT, PHONE, DATE.

Preserved Document Appearance

The output is a PDF document visually similar to the original, with obfuscated text rendered on top of the original positions of sensitive entities.

Document-Wide Consistency

Replacement is consistent across the document. For example, if “Lilian Clarke” is replaced by “Nelly Huffman” on page 1, all subsequent instances of “Lilian Clarke” are replaced with the same synthetic name throughout the document.

Copy S3 URI

How to use

from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("pdf_obfuscate_multilingual_name_plus", "en", "clinical/ocr")

deid_pipeline.transform(input_pdfs)

Model Information

Model Name: pdf_obfuscate_multilingual_name_plus
Type: pipeline
Compatibility: Healthcare NLP 6.0.0+
License: Licensed
Edition: Official
Language: en
Size: 3.8 GB

Included Models

The following models are included in the pipeline,

  • PdfToImage
  • ImageToText
  • DocumentAssembler
  • SentenceDetectorDLModel
  • RegexTokenizer
  • PretrainedZeroShotNER
  • NerConverter
  • WordEmbeddingsModel
  • MedicalNerModel
  • NerConverter
  • XLMRobertaEmbeddings
  • MedicalNerModel
  • NerConverter
  • ContextualParser
  • ChunkConverter
  • Merge
  • DeIdentification
  • NerOutputCleaner
  • PositionFinder
  • ImageDrawRegions
  • ImageToPdf