sparknlp_jsl.utils.database_deidentification#

Module Contents#

Classes#

RelationalDBDeidentification

Class to handle de-identification for relational databases, including date shifting,

Functions#

create_obfuscate_age_udf(use_hipaa, age_groups_broadcast)

Creates a Spark UDF for age obfuscation.

obfuscate_age(age, use_hipaa, age_groups)

Obfuscates the age based on HIPAA rules or predefined age groups.

class RelationalDBDeidentification(spark: pyspark.sql.SparkSession, config: dict)#

Class to handle de-identification for relational databases, including date shifting, age obfuscation, consistent obfuscation for primary and foreign keys, and masking other sensitive columns.

age_groups#
config#
days_to_shift#
phi_keywords#
pk_fk_shift_value#
spark#
use_hipaa#
connect_to_db()#

Connects to the MySQL database and returns the connection object.

Returns:

pymysql connection object.

Return type:

Connection

deidentify()#

Performs the complete de-identification process: - Fetches all tables. - Iterates through each table. - Retrieves schema information. - Detects sensitive columns. - Applies obfuscation and masking. - Saves de-identified data as CSV files.

detect_sensitive_columns(df: pyspark.sql.DataFrame) dict#

Detects sensitive columns in the DataFrame based on predefined keywords.

Parameters:

df (DataFrame) – Spark DataFrame to analyze.

Returns:

Dictionary categorizing sensitive columns into date, age, and other sensitive columns.

Return type:

dict

get_all_tables()#

Retrieves all table names from the configured database.

Returns:

List of table names.

Return type:

list

get_schema_info(table_name: str) dict#

Retrieves schema information for a specific table including date columns, primary keys, and foreign keys.

Parameters:

table_name (str) – The name of the table to retrieve schema information for.

Returns:

Dictionary containing lists of date columns, primary keys, and foreign keys for the specified table.

Return type:

dict

mask_other_sensitive_columns(df: pyspark.sql.DataFrame, other_columns: list) pyspark.sql.DataFrame#

Masks other sensitive columns by replacing their values with asterisks.

Parameters:
  • df (DataFrame) – Spark DataFrame containing the data.

  • other_columns (list) – List of column names to mask.

Returns:

DataFrame with masked sensitive columns.

Return type:

DataFrame

obfuscate_ages(df: pyspark.sql.DataFrame, age_columns: list, use_hipaa: bool) pyspark.sql.DataFrame#

Obfuscates age columns using either HIPAA rules or predefined age groups.

Parameters:
  • df (DataFrame) – Spark DataFrame containing the data.

  • age_columns (list) – List of age column names to obfuscate.

  • use_hipaa (bool) – Flag to apply HIPAA rules.

Returns:

DataFrame with obfuscated age columns.

Return type:

DataFrame

obfuscate_dates(df: pyspark.sql.DataFrame, date_columns: list) pyspark.sql.DataFrame#

Shifts date columns by a specified number of days using Spark’s built-in functions.

Parameters:
  • df (DataFrame) – Spark DataFrame containing the data.

  • date_columns (list) – List of date column names to shift.

Returns:

DataFrame with shifted date columns.

Return type:

DataFrame

obfuscate_primary_foreign_keys(df: pyspark.sql.DataFrame, pk_fk_columns: list) pyspark.sql.DataFrame#

Obfuscates primary and foreign key columns by shifting their numeric values with a fixed value.

Parameters:
  • df (DataFrame) – Spark DataFrame containing the data.

  • pk_fk_columns (list) – List of column names for primary and foreign keys.

Returns:

DataFrame with obfuscated primary and foreign keys.

Return type:

DataFrame

setup_logging()#

Sets up logging to output to both console and a log file.

create_obfuscate_age_udf(use_hipaa, age_groups_broadcast)#

Creates a Spark UDF for age obfuscation.

Parameters:
  • use_hipaa (bool) – Flag to apply HIPAA rules.

  • age_groups_broadcast (Broadcast) – Broadcasted age groups dictionary.

Returns:

Spark UDF for age obfuscation.

Return type:

UserDefinedFunction

obfuscate_age(age, use_hipaa, age_groups)#

Obfuscates the age based on HIPAA rules or predefined age groups.

Parameters:
  • age (int) – Original age.

  • use_hipaa (bool) – Flag to apply HIPAA rules.

  • age_groups (dict) – Dictionary defining age groups.

Returns:

Obfuscated age or None if input is None.

Return type:

int or None