Description
This pipeline can be used to mask PHI information in PDFs. The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities.
Predicted Entities
AGE, BIOID, CITY, COUNTRY, DATE, DEVICE, DOCTOR, EMAIL, FAX, HEALTHPLAN, HOSPITAL, IDNUM, LOCATION, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STATE, STREET, URL, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR.
Live Demo Open in Colab Download
How to use
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("pdf_deid_subentity_context_augmented_pipeline", "en", "clinical/models")
Example
Input:

Output:

Model Information
| Model Name: | pdf_deid_subentity_context_augmented_pipeline |
| Type: | pipeline |
| Compatibility: | Healthcare NLP 5.3.2+ |
| License: | Licensed |
| Edition: | Official |
| Language: | en |
| Size: | 1.7 GB |
Included Models
- PdfToImage
- ImageToText
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- TextMatcherModel
- ContextualParserModel
- RegexMatcherModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- RegexMatcherInternalModel
- ChunkMergeModel
- ChunkMergeModel
- PositionFinder
- ImageDrawRegions
- ImageToPdf
- PdfAssembler
Speed Benchmarks
- Dataset: 1000 scanned PDF pages.
- Instance :
- m5n.4xlarge (16 vCPUs, 64 GiB memory)
- m5n.8xlarge (32 vCPUs, 128 GiB memory)
- AMI: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240411
- Versions:
- spark-nlp Version: v5.4.0
- visual-nlp Version: v5.3.2
- spark-nlp-jsl Version : v5.3.2
- Spark Version : v3.4.1
- Visual NLP Pipeline: ‘pdf_deid_subentity_context_augmented_pipeline’
Benchmark Table
| Instance | memory | cores | input_data_pages | partition | second per page | timing |
|---|---|---|---|---|---|---|
| m5n.4xlarge | 64 GB | 16 | 1000 | 10 | 0.24 | 4 mins |
| m5n.8xlarge | 128 GB | 32 | 1000 | 32 | 0.15 | 2.5 mins |