`sparknlp_jsl.utils.date_shift_filler`#

Module Contents#

Classes#

DateShiftFiller

This class is used to fill missing or empty values in a date shift column

class DateShiftFiller(spark: pyspark.sql.SparkSession, seed: int = 42, max_shift_days: int = 60)#

This class is used to fill missing or empty values in a date shift column using a deterministic, ID-based pseudo-random fallback approach.

Useful in de-identification pipelines where: - Shift values must be consistent for the same ID. - Some rows may be missing shift data.

Logic: - If another row with the same ID has a non-empty shift, reuse it. - Otherwise, compute a fallback shift using a deterministic hash function based on ID and seed. - Fallback values are always in the range [1, maxShiftDays].

Parameters:

spark (SparkSession) – The active Spark session.
seed (int) – Seed value used in deterministic fallback hashing. Default is 42.
max_shift_days (int, optional) – The maximum number of days to shift when generating fallback values (default is 60).

Example

>>> filler = DateShiftFiller(spark, seed=42, max_shift_days=60)
>>> result_df = filler.fill_missing_shifts(df, id_col="note_id", shift_col="date_shift", suffix="_filled")

instance#

max_shift_days = 60#

seed = 42#

spark#

fill_missing_shifts(df: pyspark.sql.DataFrame, id_col: str, shift_col: str, suffix: str = '_filled', resolved_mode: str = 'first') → pyspark.sql.DataFrame#

Applies shift-filling logic to the given DataFrame.

Parameters:

df (pyspark.sql.DataFrame) – The input DataFrame containing the shift column and ID.
id_col (str) – The name of the column containing the grouping ID.
shift_col (str) – The name of the date shift column to process.
suffix (str) – The suffix to append to the output column (e.g., ‘_filled’).
resolvedMode (str) – How to resolve conflicts when multiple rows have the same ID (default: “first”). Options: “first”, “all”. “all” option will duplicate rows with the same ID.

Returns:

A new DataFrame with a filled shift column: shift_col + suffix

Return type:

pyspark.sql.DataFrame

sparknlp_jsl.utils.date_shift_filler#

Module Contents#

Classes#

`sparknlp_jsl.utils.date_shift_filler`#