sparknlp_jsl.utils.date_shift_filler
#
Module Contents#
Classes#
This class is used to fill missing or empty values in a date shift column |
- class DateShiftFiller(spark: pyspark.sql.SparkSession, seed: int = 42, max_shift_days: int = 60)#
This class is used to fill missing or empty values in a date shift column using a deterministic, ID-based pseudo-random fallback approach.
Useful in de-identification pipelines where: - Shift values must be consistent for the same ID. - Some rows may be missing shift data.
Logic: - If another row with the same ID has a non-empty shift, reuse it. - Otherwise, compute a fallback shift using a deterministic hash function based on ID and seed. - Fallback values are always in the range [1, maxShiftDays].
- Parameters:
spark (SparkSession) – The active Spark session.
seed (int) – Seed value used in deterministic fallback hashing. Default is 42.
max_shift_days (int, optional) – The maximum number of days to shift when generating fallback values (default is 60).
Example
>>> filler = DateShiftFiller(spark, seed=42, max_shift_days=60) >>> result_df = filler.fill_missing_shifts(df, id_col="note_id", shift_col="date_shift", suffix="_filled")
- instance#
- max_shift_days = 60#
- seed = 42#
- spark#
- fill_missing_shifts(df: pyspark.sql.DataFrame, id_col: str, shift_col: str, suffix: str = '_filled', resolved_mode: str = 'first') pyspark.sql.DataFrame #
Applies shift-filling logic to the given DataFrame.
- Parameters:
df (pyspark.sql.DataFrame) – The input DataFrame containing the shift column and ID.
id_col (str) – The name of the column containing the grouping ID.
shift_col (str) – The name of the date shift column to process.
suffix (str) – The suffix to append to the output column (e.g., ‘_filled’).
resolvedMode (str) – How to resolve conflicts when multiple rows have the same ID (default: “first”). Options: “first”, “all”. “all” option will duplicate rows with the same ID.
- Returns:
A new DataFrame with a filled shift column: shift_col + suffix
- Return type: