class DateShiftFiller extends Serializable
Utility class to fill missing or empty shift values in a DataFrame column using deterministic, ID-based pseudo-random values.
This is particularly useful in de-identification tasks where date shift values must be: - Consistent across all rows with the same identifier - Present even when the original shift value is missing or empty
Logic
For each row:
- If another row with the same ID has a known (non-empty) shift value, reuse it.
- If not, generate a fallback shift value deterministically using the ID and a seed. The generated value will always fall in the range [1, maxShiftDays].
Example usage
val filler = new DateShiftFiller(spark, seed = 42, maxShiftDays = 60) val resultDf = filler.fillMissingShifts(df, "note_id", "date_shift", "_filled")
- Alphabetic
- By Inheritance
- DateShiftFiller
- Serializable
- Serializable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
DateShiftFiller(spark: SparkSession, seed: Int = 42, maxShiftDays: Int = 60)
- spark
The active SparkSession
- seed
Seed used for deterministic hashing, ensuring repeatable fallback values
- maxShiftDays
Maximum number of days used in fallback shift generation (inclusive upper bound)
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
fillMissingShifts(df: DataFrame, idCol: String, shiftCol: String, suffix: String = "_filled", resolvedMode: String = "first"): DataFrame
Fills missing or empty values in a date-shift column using the following logic: - If other rows with the same ID have a valid value, reuse it.
Fills missing or empty values in a date-shift column using the following logic: - If other rows with the same ID have a valid value, reuse it. - If not, generate a deterministic pseudo-random value based on ID and seed.
The result is written to a new column using the given suffix, keeping the original column untouched.
- df
Input DataFrame
- idCol
ID column name (grouping key)
- shiftCol
Column with optional shift values
- suffix
Suffix to append to the new output column (e.g., "_filled"), Default is "_filled"
- resolvedMode
How to resolve conflicts when multiple rows have the same ID (default: "first") Options: "first", "all". "all" option will duplicate rows with the same ID.
- returns
DataFrame with a new shift column: shiftCol + suffix
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()