sparknlp_jsl.utils.deidentification_utils#

Module Contents#

Functions#

dict_to_annotation_converter(spark, data[, ...])

This method is used to convert a list of dictionaries to a Spark DataFrame. The returned DataFrame will contain chunk and document columns that are compatible with Spark NLP and can be used for deidentification.

structured_deidentifier(spark[, input_file_path, ...])

This method is used to deidentify structured data. It takes the input as a file path and returns a deidentified dataframe and a file in csv/json format.

dict_to_annotation_converter(spark: pyspark.sql.SparkSession, data: list, document_col_name: str = 'document', chunk_col_name: str = 'chunk', adjust_end: bool = False, chunk_validation_options: dict = None)#

This method is used to convert a list of dictionaries to a Spark DataFrame. The returned DataFrame will contain chunk and document columns that are compatible with Spark NLP and can be used for deidentification.

The input data should be a list of dictionaries. Each dictionary should have a “text” key and a “chunks” key. The “text” key should contain the text of the document and the “chunks” key should contain a list of dictionaries.

Each dictionary in the “chunks” list should have “begin”, “end”, “result”, “entity” and “metadata” keys.

The “begin” key should contain the start index of the chunk, the “end” key should contain the end index of the chunk, the “result” key should contain the text of the chunk, the “entity” key should contain the entity type of the chunk and the “metadata” key should contain the metadata of the chunk.

Parameters:
  • spark (SparkSession) – The Spark session object.

  • data (list) – A list of dictionaries.

  • document_col_name (str) – The name of the document column. Default: “document”.

  • chunk_col_name (str) – The name of the chunk column. Default: “chunk”.

  • adjust_end (bool) – If True, the end values of the chunks will minus 1. Default: False.

  • chunk_validation_options (dict) –

    A dictionary specifying options for chunk validation. Default is None. None means no validation. Possible keys include: - “validation_mode” (str): Determines what happens if a chunk cannot be validated:

    • ”filter” (default): Invalid chunks are removed from the chunks.

    • ”error”: A ValueError is raised if any chunk cannot be validated.

    • ”scope_window” (list of int): A list of two integers [before, after] that define the number of characters

      before and after the chunk’s begin and end index to include in the search window. Default is [0, 0].

    • ”case_sensitive” (bool): If False, chunk matching is case-insensitive. Default is True.

Notes

  • The provided begin, end, and result values are not validated for accuracy. It is the user’s responsibility to ensure these values are correct.

The validation can be done by setting the chunk_validation_options parameter.

  • Spark NLP annotators calculate the end index of the annotations by minus 1 from the actual end index.

If the end index is not minus 1, the annotators will not be able to calculate the correct end index. Therefore, the end index of the chunks should be minus 1. If the end index is not minus 1 in chunks, the adjust_end parameter should be set to True.

  • document_col_name and chunk_col_name should not be the same as the column names in the input data.

  • The name of ‘document_metadata’ is reserved for the metadata column of the document.

‘document_metadata’ values should be a dictionary. All dictionaries must have a ‘document_metadata’ key if the ‘document_metadata’ is provided in one of the dictionaries.

  • ‘text’ and ‘chunks’ keys are reserved for the text and chunks of the document.

‘text’ values should be a string and ‘chunks’ values should be a list of dictionaries.

  • ‘begin’, ‘end’, ‘result’, ‘entity’ and ‘metadata’ keys are reserved for the chunk annotations.

‘begin’ and ‘end’ values should be integers, ‘result’ values should be a string, ‘entity’ values should be a string and ‘metadata’ values should be a dictionary.

Returns:

  • Spark DataFrame (A Spark DataFrame with the converted columns which are compatible with Spark NLP and is ready to deidentify.)

  • The returned DataFrame will contain chunk and document columns that are compatible with Spark NLP and can be used for deidentification.

  • Additionally, the returned DataFrame will contain all the columns from the input data except the “chunks” column.

Raises:

ValueError – If the validation_mode is set to “error” or the given parameters and options are not valid.

Example

>>> list_of_dict = [
...        {
...            "text": "My name is George, and I was born on 12/11/1995. I have the pleasure of working at John Snow Labs.",
...            "chunks": [
...                {
...                    "begin": 11,
...                    "end": 16,
...                    "result": "George",
...                    "entity": "PERSON",
...                    "metadata": {"confidence": "1", "ner_source": "llm_output"}
...                },
...                {
...                    "begin": 37,
...                    "end": 46,
...                    "result": "12/11/1995",
...                    "entity": "DATE",
...                    "metadata": {"confidence": "0.9", "ner_source": "llm_output"}
...                },
...                {
...                    "begin": 83,
...                    "end": 96,
...                    "result": "John Snow labs",
...                    "entity": "ORG",
...                    "metadata": {"confidence": "0.87", "ner_source": "llm_output"}
...                }
...                ],
...            "doc_id": "1",
...            "document_metadata": {"dateshift": "10"},
...            "file_path": "/path/to/file1"
...        },
...        {
...            "text": "I am Bush, and English is my native language. You can reach me at my email: bush@example.com.",
...
...            "chunks": [
...                {
...                    "begin": 5,
...                    "end": 8,
...                    "result": "Bush",
...                    "entity": "PERSON",
...                    "metadata": {"confidence": "1", "ner_source": "ner_dl"}
...                },
...                {
...                    "begin": 15,
...                    "end": 21,
...                    "result": "English",
...                    "entity": "LANGUAGE",
...                    "metadata": {"confidence": "0.98", "ner_source": "ner_dl"}
...                },
...                {
...                    "begin": 76,
...                    "end": 91,
...                    "result": "bush@example.com",
...                    "entity": "EMAIL",
...                    "metadata": {"confidence": "0.87", "ner_source": "ner_dl"}
...                }
...            ],
...            "doc_id": "2",
...            "document_metadata": {"dateshift": "5"},
...            "file_path": "/path/to/file2"
...        }
...    ]
>>> from sparknlp.annotator import *
>>> from sparknlp_jsl.utils import *
>>> result_df = dict_to_annotation_converter(spark, list_of_dict, document_col_name="document", chunk_col_name="chunk", adjust_end=False)
>>> light_deIdentification = (
...        LightDeIdentification()
...        .setInputCols(["document", "chunk"]) # The document and chunk columns created by the dict_to_annotation_converter function.
...        .setOutputCol("deidentified")
...        .setMode("mask")
...    )
>>> light_deIdentification.transform(result_df).selectExpr("deidentified.result").show(truncate=False)

result

[My name is <PERSON>, and I was born on <DATE>. I have the pleasure of working at <ORG>.] [I am <PERSON>, and <LANGUAGE> is my native language. You can reach me at my email: <EMAIL>.]

structured_deidentifier(spark, input_file_path=None, output_file_path='deidentified.csv', separator=',', columns_dict=None, ref_source='faker', obfuscateRefFile=None, columns_seed=None, shift_days=None, date_formats=None, language='en', id_column='')#

This method is used to deidentify structured data. It takes the input as a file path and returns a deidentified dataframe and a file in csv/json format.

Parameters:
  • (str) (idColumn) –

  • (str)

  • (str)

  • (dict) (columns_dict) –

  • (str)

  • (str)

  • (int) (shift_days) –

  • (int)

  • (list) (date_formats) –

  • (str)

  • (str)

  • Returns

  • -------

  • DataFrame (Spark) –

  • file (csv/json) –