Packages

package fhir

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. abstract class BaseFhirDeIdentification extends Transformer with HasFeatures with LightDeIdentificationParams with DeidModelParams with CheckLicense with HasInputCol with HasOutputAnnotationCol with ParamsAndFeaturesWritable
  2. class CdaDeIdentification extends BaseFhirDeIdentification

    A Spark Transformer for de-identifying HL7 CDA (Clinical Document Architecture) documents using javax.xml DOM parsing.

    A Spark Transformer for de-identifying HL7 CDA (Clinical Document Architecture) documents using javax.xml DOM parsing.

    Overview

    Performs field-level and free-text de-identification on CDA R2 XML documents using XPath expressions. Supports both structured field obfuscation and NLP-based free-text de-identification.

    Key Features

    - **XPath-based de-identification**: Target specific CDA elements using dot notation - **Attribute support**: Access XML attributes using @ notation (e.g., "telecom/@value") - **Namespace-aware**: Handles HL7 v3 namespace automatically - **Structured field obfuscation**: Replace names, dates, addresses, phone numbers, IDs - **Free-text de-identification**: Process narrative sections using NLP pipelines

    Path Notation

    Paths support both dot (.) and slash (/) as separators. Attributes can be specified with or without @ prefix. Free-text container paths (e.g. section.text) apply de-identification recursively to all descendant TEXT_NODEs.

    // Element paths (both notations work)
    "recordTarget.patientRole.patient.name.given"
    "recordTarget/patientRole/patient/name/given"
    
    // Attribute paths (all formats work)
    "recordTarget.patientRole.telecom/@value"     // explicit @
    "recordTarget.patientRole.telecom.value"      // auto-detected
    "recordTarget/patientRole/telecom/@value"     // with slashes
    "recordTarget.patientRole.id.extension"       // known attribute
    
    // Nested structures
    "component.structuredBody.component.section.text"
    "component/structuredBody/component/section/text"

    Usage Example

    val deid = new CdaDeIdentification()
      .setInputCol("cda_xml")
      .setOutputCol("deidentified_cda")
      .setMode("obfuscate")
      .setMappingRules(Map(
        "recordTarget.patientRole.patient.name.given" -> "first_name",
        "recordTarget.patientRole.patient.name.family" -> "last_name",
        "recordTarget.patientRole.addr.streetAddressLine" -> "Address",
        "recordTarget.patientRole.telecom/@value" -> "Phone",
        "author.assignedAuthor.assignedPerson.name.given" -> "first_name",
        "custodian.assignedCustodian.representedCustodianOrganization.name" -> "Organization"
      ))
      .setFreeTextPaths(Array(
        "component.structuredBody.component.section.text"
      ))
      .setPipeline(spark, deidPipeline, "deidentified")
    
    val result = deid.deidentify(cdaXmlString)

    Structured Narrative Handling (Tables, Definition Lists)

    When tableHandling is true (the default), structured elements inside free-text blocks are processed with header-aware context so the NLP pipeline receives a longer, more informative input instead of a single short cell. The following structures are recognized:

    • `: column headers (``) are never obfuscated under this mode, which prevents column/row labels from being mutated and improves accuracy on cell values. Set `tableHandling` to `false` to fall back to the legacy recursive obfuscation that walks every descendant text node uniformly. Use `excludeFreeTextTags` to fully skip any additional element by local name (case insensitive) during free-text recursion, e.g. `Array("sup", "footnote")`.
      See also

      HL7 CDA Standard

      BaseFhirDeIdentification for base de-identification functionality

    • class FhirDeIdentification extends BaseFhirDeIdentification

      A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.

      A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.

      Overview

      Performs field-level obfuscation on FHIR JSON documents using FHIR Path expressions. Supports R4, R5, and DSTU3 FHIR versions with type-aware de-identification strategies. Additionally, supports different parser types (JSON, XML) for FHIR resources.

      Example:
      1. Basic Pipeline Usage

        val deid = new FhirDeIdentification()
          .setInputCol("raw_fhir")
          .setOutputCol("deidentified")
          .setMode("obfuscate")
          .setMappingRules(Map("Patient.birthDate" -> "Date"))
        
        val pipeline = new Pipeline().setStages(Array(deid))
      See also

      FHIR Specification

    • trait PretrainedReadableCdaDeIdentification extends ParamsAndFeaturesReadable[CdaDeIdentification] with HasPretrained[CdaDeIdentification]
    • trait PretrainedReadableFhirDeIdentification extends ParamsAndFeaturesReadable[FhirDeIdentification] with HasPretrained[FhirDeIdentification]
    • Value Members

      1. object CdaDeIdentification extends PretrainedReadableCdaDeIdentification with Serializable

        This is the companion object of CdaDeIdentification.

        This is the companion object of CdaDeIdentification. Please refer to that class for the documentation.

      2. object FhirDeIdentification extends PretrainedReadableFhirDeIdentification with Serializable

        This is the companion object of FhirDeIdentification.

        This is the companion object of FhirDeIdentification. Please refer to that class for the documentation.

      3. object FhirParserTypes
        Attributes
        protected
      4. object FhirUtil
        Attributes
        protected
      5. object FhirVersions
        Attributes
        protected

      Ungrouped

      ` inside the header row) are preserved verbatim and each data cell (``) is sent to the pipeline as `"
      : "`. The obfuscated cell value is then extracted from the pipeline output and written back into the document. When a data row begins with a row-label `
      ` (e.g. "Blood Pressure"), that label is used as the per-row context instead of the column header. For tables without a strict ``-only header row, the first row of a multi-row, multi-column table is automatically promoted to column headers. - Inline cell labels: when a `` (or `
      `) starts with a styling element such as `Title:`, that element is treated as the cell's own label (overrides any external header) and is preserved verbatim - only the text after it is obfuscated. - `
      ` line splitting: a cell's value text is split on top-level `
      ` elements and each line is processed independently, so multi-paragraph narrative cells keep their line layout and never produce concatenated tokens like `Hgb12.5`. - `
      ` / `
      ` / `
      `: each `
      ` is processed with its preceding `
      ` as context. `
      ` and `
      ` (and the rest of `