package fhir
- Alphabetic
- Public
- All
Type Members
- abstract class BaseFhirDeIdentification extends Transformer with HasFeatures with LightDeIdentificationParams with DeidModelParams with CheckLicense with HasInputCol with HasOutputAnnotationCol with ParamsAndFeaturesWritable
-
class
CdaDeIdentification extends BaseFhirDeIdentification
A Spark Transformer for de-identifying HL7 CDA (Clinical Document Architecture) documents using javax.xml DOM parsing.
A Spark Transformer for de-identifying HL7 CDA (Clinical Document Architecture) documents using javax.xml DOM parsing.
Overview
Performs field-level and free-text de-identification on CDA R2 XML documents using XPath expressions. Supports both structured field obfuscation and NLP-based free-text de-identification.
Key Features
- **XPath-based de-identification**: Target specific CDA elements using dot notation - **Attribute support**: Access XML attributes using @ notation (e.g., "telecom/@value") - **Namespace-aware**: Handles HL7 v3 namespace automatically - **Structured field obfuscation**: Replace names, dates, addresses, phone numbers, IDs - **Free-text de-identification**: Process narrative sections using NLP pipelines
Path Notation
Paths support both dot (.) and slash (/) as separators. Attributes can be specified with or without @ prefix. Free-text container paths (e.g. section.text) apply de-identification recursively to all descendant TEXT_NODEs.
// Element paths (both notations work) "recordTarget.patientRole.patient.name.given" "recordTarget/patientRole/patient/name/given" // Attribute paths (all formats work) "recordTarget.patientRole.telecom/@value" // explicit @ "recordTarget.patientRole.telecom.value" // auto-detected "recordTarget/patientRole/telecom/@value" // with slashes "recordTarget.patientRole.id.extension" // known attribute // Nested structures "component.structuredBody.component.section.text" "component/structuredBody/component/section/text"
Usage Example
val deid = new CdaDeIdentification() .setInputCol("cda_xml") .setOutputCol("deidentified_cda") .setMode("obfuscate") .setMappingRules(Map( "recordTarget.patientRole.patient.name.given" -> "first_name", "recordTarget.patientRole.patient.name.family" -> "last_name", "recordTarget.patientRole.addr.streetAddressLine" -> "Address", "recordTarget.patientRole.telecom/@value" -> "Phone", "author.assignedAuthor.assignedPerson.name.given" -> "first_name", "custodian.assignedCustodian.representedCustodianOrganization.name" -> "Organization" )) .setFreeTextPaths(Array( "component.structuredBody.component.section.text" )) .setPipeline(spark, deidPipeline, "deidentified") val result = deid.deidentify(cdaXmlString)
Structured Narrative Handling (Tables, Definition Lists)
When
tableHandlingistrue(the default), structured elements inside free-text blocks are processed with header-aware context so the NLP pipeline receives a longer, more informative input instead of a single short cell. The following structures are recognized:`: column headers (`
` inside the header row) are preserved verbatim and each data cell (` `) is sent to the pipeline as `" : "`. The obfuscated cell value is then extracted from the pipeline output and written back into the document. When a data row begins with a row-label ` | ` (e.g. "Blood Pressure"), that label is used as the per-row context instead of the column header. For tables without a strict ` `-only header row, the first row of a multi-row, multi-column table is automatically promoted to column headers. - Inline cell labels: when a ` ` (or ` - `) starts with a styling element such as `
Title: `, that element is treated as the cell's own label (overrides any external header) and is preserved verbatim - only the text after it is obfuscated. - `
` line splitting: a cell's value text is split on top-level `
` elements and each line is processed independently, so multi-paragraph narrative cells keep their line layout and never produce concatenated tokens like `Hgb12.5`. - `- ` / `
- ` / `
- `: each `
- ` is processed with its preceding `
- ` as context.
`
` and ` - ` (and the rest of ``) are never obfuscated under this mode, which prevents column/row labels from being mutated and improves accuracy on cell values. Set `tableHandling` to `false` to fall back to the legacy recursive obfuscation that walks every descendant text node uniformly. Use `excludeFreeTextTags` to fully skip any additional element by local name (case insensitive) during free-text recursion, e.g. `Array("sup", "footnote")`.
- See also
BaseFhirDeIdentification for base de-identification functionality
- class FhirDeIdentification extends BaseFhirDeIdentification
A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.
A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.
Overview
Performs field-level obfuscation on FHIR JSON documents using FHIR Path expressions. Supports R4, R5, and DSTU3 FHIR versions with type-aware de-identification strategies. Additionally, supports different parser types (JSON, XML) for FHIR resources.
Basic Pipeline Usage
val deid = new FhirDeIdentification() .setInputCol("raw_fhir") .setOutputCol("deidentified") .setMode("obfuscate") .setMappingRules(Map("Patient.birthDate" -> "Date")) val pipeline = new Pipeline().setStages(Array(deid))
- See also
Example:- trait PretrainedReadableCdaDeIdentification extends ParamsAndFeaturesReadable[CdaDeIdentification] with HasPretrained[CdaDeIdentification]
- trait PretrainedReadableFhirDeIdentification extends ParamsAndFeaturesReadable[FhirDeIdentification] with HasPretrained[FhirDeIdentification]
Value Members
-
object
CdaDeIdentification extends PretrainedReadableCdaDeIdentification with Serializable
This is the companion object of CdaDeIdentification.
This is the companion object of CdaDeIdentification. Please refer to that class for the documentation.
-
object
FhirDeIdentification extends PretrainedReadableFhirDeIdentification with Serializable
This is the companion object of FhirDeIdentification.
This is the companion object of FhirDeIdentification. Please refer to that class for the documentation.
-
object
FhirParserTypes
- Attributes
- protected
-
object
FhirUtil
- Attributes
- protected
-
object
FhirVersions
- Attributes
- protected
Ungrouped
- ` (and the rest of ``) are never obfuscated under this mode, which prevents column/row labels from being mutated and improves accuracy on cell values. Set `tableHandling` to `false` to fall back to the legacy recursive obfuscation that walks every descendant text node uniformly. Use `excludeFreeTextTags` to fully skip any additional element by local name (case insensitive) during free-text recursion, e.g. `Array("sup", "footnote")`.
- `) starts with a styling element such as `