sparknlp_jsl.utils.database_deidentification
#
Module Contents#
Classes#
Class to handle de-identification for relational databases, including date shifting, |
Functions#
|
Creates a Spark UDF for age obfuscation. |
|
Obfuscates the age based on HIPAA rules or predefined age groups. |
- class RelationalDBDeidentification(spark: pyspark.sql.SparkSession, config: dict)#
Class to handle de-identification for relational databases, including date shifting, age obfuscation, consistent obfuscation for primary and foreign keys, and masking other sensitive columns.
- age_groups#
- config#
- days_to_shift#
- phi_keywords#
- pk_fk_shift_value#
- spark#
- use_hipaa#
- connect_to_db()#
Connects to the MySQL database and returns the connection object.
- Returns:
pymysql connection object.
- Return type:
Connection
- deidentify()#
Performs the complete de-identification process: - Fetches all tables. - Iterates through each table. - Retrieves schema information. - Detects sensitive columns. - Applies obfuscation and masking. - Saves de-identified data as CSV files.
- detect_sensitive_columns(df: pyspark.sql.DataFrame) dict #
Detects sensitive columns in the DataFrame based on predefined keywords.
- Parameters:
df (DataFrame) – Spark DataFrame to analyze.
- Returns:
Dictionary categorizing sensitive columns into date, age, and other sensitive columns.
- Return type:
dict
- get_all_tables()#
Retrieves all table names from the configured database.
- Returns:
List of table names.
- Return type:
list
- get_schema_info(table_name: str) dict #
Retrieves schema information for a specific table including date columns, primary keys, and foreign keys.
- Parameters:
table_name (str) – The name of the table to retrieve schema information for.
- Returns:
Dictionary containing lists of date columns, primary keys, and foreign keys for the specified table.
- Return type:
dict
- mask_other_sensitive_columns(df: pyspark.sql.DataFrame, other_columns: list) pyspark.sql.DataFrame #
Masks other sensitive columns by replacing their values with asterisks.
- Parameters:
df (DataFrame) – Spark DataFrame containing the data.
other_columns (list) – List of column names to mask.
- Returns:
DataFrame with masked sensitive columns.
- Return type:
DataFrame
- obfuscate_ages(df: pyspark.sql.DataFrame, age_columns: list, use_hipaa: bool) pyspark.sql.DataFrame #
Obfuscates age columns using either HIPAA rules or predefined age groups.
- Parameters:
df (DataFrame) – Spark DataFrame containing the data.
age_columns (list) – List of age column names to obfuscate.
use_hipaa (bool) – Flag to apply HIPAA rules.
- Returns:
DataFrame with obfuscated age columns.
- Return type:
DataFrame
- obfuscate_dates(df: pyspark.sql.DataFrame, date_columns: list) pyspark.sql.DataFrame #
Shifts date columns by a specified number of days using Spark’s built-in functions.
- Parameters:
df (DataFrame) – Spark DataFrame containing the data.
date_columns (list) – List of date column names to shift.
- Returns:
DataFrame with shifted date columns.
- Return type:
DataFrame
- obfuscate_primary_foreign_keys(df: pyspark.sql.DataFrame, pk_fk_columns: list) pyspark.sql.DataFrame #
Obfuscates primary and foreign key columns by shifting their numeric values with a fixed value.
- Parameters:
df (DataFrame) – Spark DataFrame containing the data.
pk_fk_columns (list) – List of column names for primary and foreign keys.
- Returns:
DataFrame with obfuscated primary and foreign keys.
- Return type:
DataFrame
- setup_logging()#
Sets up logging to output to both console and a log file.
- create_obfuscate_age_udf(use_hipaa, age_groups_broadcast)#
Creates a Spark UDF for age obfuscation.
- Parameters:
use_hipaa (bool) – Flag to apply HIPAA rules.
age_groups_broadcast (Broadcast) – Broadcasted age groups dictionary.
- Returns:
Spark UDF for age obfuscation.
- Return type:
UserDefinedFunction
- obfuscate_age(age, use_hipaa, age_groups)#
Obfuscates the age based on HIPAA rules or predefined age groups.
- Parameters:
age (int) – Original age.
use_hipaa (bool) – Flag to apply HIPAA rules.
age_groups (dict) – Dictionary defining age groups.
- Returns:
Obfuscated age or None if input is None.
- Return type:
int or None