Description
This pipeline can be used to mask PHI information in PDFs. The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities.
Predicted Entities
AGE
, BIOID
, CITY
, COUNTRY
, DATE
, DEVICE
, DOCTOR
, EMAIL
, FAX
, HEALTHPLAN
, HOSPITAL
, IDNUM
, LOCATION
, MEDICALRECORD
, ORGANIZATION
, PATIENT
, PHONE
, PROFESSION
, STATE
, STREET
, URL
, USERNAME
, ZIP
, ACCOUNT
, LICENSE
, VIN
, SSN
, DLN
, PLATE
, IPADDR
.
Live Demo Open in Colab Download
How to use
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("pdf_deid_subentity_context_augmented_pipeline", "en", "clinical/models")
Example
Input:
Output:
Model Information
Model Name: | pdf_deid_subentity_context_augmented_pipeline |
Type: | pipeline |
Compatibility: | Healthcare NLP 5.3.2+ |
License: | Licensed |
Edition: | Official |
Language: | en |
Size: | 1.7 GB |
Included Models
- PdfToImage
- ImageToText
- DocumentAssembler
- SentenceDetectorDLModel
- TokenizerModel
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- TextMatcherModel
- ContextualParserModel
- RegexMatcherModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- RegexMatcherInternalModel
- ChunkMergeModel
- ChunkMergeModel
- PositionFinder
- ImageDrawRegions
- ImageToPdf
- PdfAssembler
Speed Benchmarks
- Dataset: 1000 scanned PDF pages.
- Instance :
- m5n.4xlarge (16 vCPUs, 64 GiB memory)
- m5n.8xlarge (32 vCPUs, 128 GiB memory)
- AMI: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240411
- Versions:
- spark-nlp Version: v5.4.0
- visual-nlp Version: v5.3.2
- spark-nlp-jsl Version : v5.3.2
- Spark Version : v3.4.1
- Visual NLP Pipeline: ‘pdf_deid_subentity_context_augmented_pipeline’
Benchmark Table
Instance | memory | cores | input_data_pages | partition | second per page | timing |
---|---|---|---|---|---|---|
m5n.4xlarge | 64 GB | 16 | 1000 | 10 | 0.24 | 4 mins |
m5n.8xlarge | 128 GB | 32 | 1000 | 32 | 0.15 | 2.5 mins |