PDF Deidentification Multi Model Context Signature Aware

Description

This pipeline can be used to mask PHI information in PDFs. Masked entities include ‘AGE’, ‘CITY’, ‘COUNTRY’, ‘DATE’, ‘DOCTOR’, ‘EMAIL’, ‘HOSPITAL’, ‘IDNUM’, ‘ORGANIZATION’, ‘PATIENT’, ‘PHONE’, ‘PROFESSION’, ‘STATE’, ‘STREET’, ‘USERNAME’, ‘ZIP’. The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities, also includes removing signatures.

Copy S3 URI

How to use

from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("pdf_deid_multi_model_context_signature_aware_pipeline", "en", "clinical/ocr")

Model Information

Model Name: pdf_deid_multi_model_context_signature_aware_pipeline
Type: pipeline
Compatibility: Healthcare NLP 6.0.0+
License: Licensed
Edition: Official
Language: en
Size: 4.7 GB

Included Models

  • PdfToImage
  • ImageToText
  • DocumentAssembler
  • SentenceDetectorDLModel
  • Regex
  • WordEmbeddingsModel
  • MedicalNerModel
  • NerConverter
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • EntityExtractor
  • ContextualParserModel
  • RegexMatcher
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • RegexMatcher
  • ChunkMergeModel
  • ChunkMergeModel
  • XLMRobertaEmbeddings
  • MedicalNerModel
  • NerConverter
  • PretrainedZeroShotNER
  • NerConverter
  • PretrainedZeroShotNER
  • NerConverter
  • ChunkMergeModel
  • PositionFinder
  • ImageDrawRegions
  • HW_Signature_Detector
  • ImageDrawRegions
  • ImageToPdf