PDF Deidentification Multilingual Name Plus Signature Aware

Description

This pipeline can be used to mask PHI information in PDFs. The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities, also includes removing signatures.

Predicted Entities

HOSPITAL, NAME, PATIENT, ID,MEDICALRECORD, IDNUM, COUNTRY, LOCATION, STREET, STATE, ZIP, CONTACT, PHONE, DATE.

Live Demo Open in Colab Download

How to use

from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("pdf_deid_multilingual_name_plus_signature_aware", "en", "clinical/ocr")

Example

Input:

Screenshot

Output:

Screenshot

Model Information

Model Name: pdf_deid_multilingual_name_plus_signature_aware
Type: pipeline
Compatibility: Healthcare NLP 6.0.0+
License: Licensed
Edition: Official
Language: en
Size: 4.0 GB

Included Models

  • PdfToImage
  • ImageToText
  • DocumentAssembler
  • SentenceDetectorDLModel
  • RegexTokenizer
  • PretrainedZeroShotNER
  • NerConverter
  • WordEmbeddingsModel
  • MedicalNerModel
  • NerConverter
  • XLMRobertaEmbeddings
  • MedicalNerModel
  • NerConverter
  • ContextualParser
  • ChunkConverter
  • Merge
  • DeIdentification
  • NerOutputCleaner
  • PositionFinder
  • ImageDrawRegions
  • HW_Signature_Detector
  • ImageDrawRegions
  • ImageToPdf