Description
This pipeline can be used to mask PHI information in PDFs. The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities, also includes removing signatures.
Predicted Entities
HOSPITAL
, NAME
, PATIENT
, ID
,MEDICALRECORD
, IDNUM
, COUNTRY
, LOCATION
, STREET
, STATE
, ZIP
, CONTACT
, PHONE
, DATE
.
Live Demo Open in Colab Download
How to use
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("pdf_deid_multilingual_name_plus_signature_aware", "en", "clinical/ocr")
Example
Input:
Output:
Model Information
Model Name: | pdf_deid_multilingual_name_plus_signature_aware |
Type: | pipeline |
Compatibility: | Healthcare NLP 6.0.0+ |
License: | Licensed |
Edition: | Official |
Language: | en |
Size: | 4.0 GB |
Included Models
- PdfToImage
- ImageToText
- DocumentAssembler
- SentenceDetectorDLModel
- RegexTokenizer
- PretrainedZeroShotNER
- NerConverter
- WordEmbeddingsModel
- MedicalNerModel
- NerConverter
- XLMRobertaEmbeddings
- MedicalNerModel
- NerConverter
- ContextualParser
- ChunkConverter
- Merge
- DeIdentification
- NerOutputCleaner
- PositionFinder
- ImageDrawRegions
- HW_Signature_Detector
- ImageDrawRegions
- ImageToPdf