class DateShiftFiller extends Serializable

Utility class to fill missing or empty shift values in a DataFrame column using deterministic, ID-based pseudo-random values.

This is particularly useful in de-identification tasks where date shift values must be: - Consistent across all rows with the same identifier - Present even when the original shift value is missing or empty

Logic

For each row:

  • If another row with the same ID has a known (non-empty) shift value, reuse it.
  • If not, generate a fallback shift value deterministically using the ID and a seed. The generated value will always fall in the range [1, maxShiftDays].
Example usage
val filler = new DateShiftFiller(spark, seed = 42, maxShiftDays = 60)
val resultDf = filler.fillMissingShifts(df, "note_id", "date_shift", "_filled")
Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DateShiftFiller
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DateShiftFiller(spark: SparkSession, seed: Int = 42, maxShiftDays: Int = 60)

    spark

    The active SparkSession

    seed

    Seed used for deterministic hashing, ensuring repeatable fallback values

    maxShiftDays

    Maximum number of days used in fallback shift generation (inclusive upper bound)

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  8. def fillMissingShifts(df: DataFrame, idCol: String, shiftCol: String, suffix: String = "_filled", resolvedMode: String = "first"): DataFrame

    Fills missing or empty values in a date-shift column using the following logic: - If other rows with the same ID have a valid value, reuse it.

    Fills missing or empty values in a date-shift column using the following logic: - If other rows with the same ID have a valid value, reuse it. - If not, generate a deterministic pseudo-random value based on ID and seed.

    The result is written to a new column using the given suffix, keeping the original column untouched.

    df

    Input DataFrame

    idCol

    ID column name (grouping key)

    shiftCol

    Column with optional shift values

    suffix

    Suffix to append to the new output column (e.g., "_filled"), Default is "_filled"

    resolvedMode

    How to resolve conflicts when multiple rows have the same ID (default: "first") Options: "first", "all". "all" option will duplicate rows with the same ID.

    returns

    DataFrame with a new shift column: shiftCol + suffix

  9. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  10. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  11. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  13. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  14. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  15. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  17. def toString(): String
    Definition Classes
    AnyRef → Any
  18. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  19. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped