Utility & Helper Modules

 

NLP Lab (Annotation Lab) Interface Module

Spark NLP for Healthcare provides functionality to interact with the NLP Lab using easy-to-use functions. NLP Lab is a tool for multi-modal data annotation. It allows annotation teams to efficiently collaborate to generate training data for ML models and/or to validate automatic annotations generated by those.

ALAB

NLP Lab Intreacting Module provides programmatic interactions with the NLP Lab. A detailed usage examples can be found at Complete NLP Lab Module SparkNLP JSL, and Python’s documentation in the Python API. Following are the functionalities supported by the module:

  • Generating a CoNLL formatted file from the annotation JSON for training an NER model.
  • Generating a csv/excel formatted file from the annotation JSON for training classification, assertion, and relation extraction models.
  • Build preannotation JSON file using Spark NLP pipelines, saving it as a JSON and uploading preannotations to a project.
  • Interacting with the NLP Lab instance, and setting up projects for NLP Lab.
  • Getting the list of all projects in the NLP Lab instance.
  • Creating New Projects.
  • Deleting Projects.
  • Setting & editing configuration of projects.
  • Accessing/getting configuration of any existing project.
  • Upload tasks to a project.
  • Deleting tasks of a project.

Start Module

# import the module
from sparknlp_jsl.alab import AnnotationLab
alab = AnnotationLab()

Generate Data for Traing a Classification Model

alab.get_classification_data(

# required: path to NLP Lab JSON export
input_json_path='alab_demo.json',

# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
# ground_truth=False)

Converting The Json Export into a Conll Format Suitable for Training an Ner Model

alab.get_conll_data(

# required: Spark session with spark-nlp-jsl jar
spark=spark,

# required: path to NLP Lab JSON export
input_json_path="alab_demo.json",

# required: name of the CoNLL file to save
output_name="conll_demo",

# optional: path for CoNLL file saving directory, defaults to 'exported_conll'
# save_dir="exported_conll",

# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
# ground_truth=False,

# optional: labels to exclude from CoNLL; these are all assertion labels and irrelevant NER labels,
# defaults to empty list
# excluded_labels=['ABSENT'],

# optional: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined
# regex_pattern="\\s+|(?=[-.:;*+,$&%\\[\\]])|(?<=[-.:;*+,$&%\\[\\]])"

# optional: list of NLP Lab task titles to exclude from CoNLL, defaults to empty list
# excluded_task_ids = [2, 3]

# optional: list of NLP Lab task titles to exclude from CoNLL, defaults to None
# excluded_task_titles = ['Note 1'])

Converting The JSON Export into a Dataframe Suitable for Training an Assertion Model

alab.get_assertion_data(

# required: SparkSession with spark-nlp-jsl jar
spark=spark,

# required: path to NLP Lab JSON export
input_json_path = 'alab_demo.json',

# required: annotated assertion labels to train on
assertion_labels = ['ABSENT'],

# required: relevant NER labels that are assigned assertion labels
relevant_ner_labels = ['PROBLEM', 'TREATMENT'],

# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
# ground_truth = False,

# optional: assertion label to assign to entities that have no assertion labels, defaults to None
# unannotated_label = 'PRESENT',

# optional: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined
# regex_pattern = "\\s+|(?=[-.:;*+,$&%\\[\\]])|(?<=[-.:;*+,$&%\\[\\]])",

# optional: set the strategy to control the number of occurrences of the unannotated assertion label
# in the output dataframe, options are 'weighted' or 'counts', 'weighted' allows to sample using a
# fraction, 'counts' allows to sample using absolute counts, defaults to None
# unannotated_label_strategy = None,

# optional: dictionary in the format {'ENTITY_LABEL': sample_weight_or_counts} to control the number of
# occurrences of the unannotated assertion label in the output dataframe, where 'ENTITY_LABEL' are the
# NER labels that are assigned the unannotated assertion label, and sample_weight_or_counts should be
# between 0 and 1 if `unannotated_label_strategy` is 'weighted' or between 0 and the max number of
# occurrences of that NER label if `unannotated_label_strategy` is 'counts'
# unannotated_label_strategy_dict = {'PROBLEM': 0.5, 'TREATMENT': 0.5},

# optional: list of NLP Lab task IDs to exclude from output dataframe, defaults to None
# excluded_task_ids = [2, 3]

# optional: list of NLP Lab task titles to exclude from output dataframe, defaults to None
# excluded_task_titles = ['Note 1'])

Converting The JSON Export into a Dataframe Suitable for Training a Relation Extraction Model

alab.get_relation_extraction_data(

# required: Spark session with spark-nlp-jsl jar
spark=spark,

# required: path to NLP Lab JSON export
input_json_path='alab_demo.json',

# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
ground_truth=True,

# optional: set to True to assign a relation label between entities where no relation was annotated,
# defaults to False
negative_relations=True,

# optional: all assertion labels that were annotated in the NLP Lab, defaults to None
assertion_labels=['ABSENT'],

# optional: plausible pairs of entities for relations, separated by a '-', use the same casing as the
# annotations, include only one relation direction, defaults to all possible pairs of annotated entities
relation_pairs=['DATE-PROBLEM','TREATMENT-PROBLEM','TEST-PROBLEM'],

# optional: set the strategy to control the number of occurrences of the negative relation label
# in the output dataframe, options are 'weighted' or 'counts', 'weighted' allows to sample using a
# fraction, 'counts' allows to sample using absolute counts, defaults to None
negative_relation_strategy='weighted',

# optional: dictionary in the format {'ENTITY1-ENTITY2': sample_weight_or_counts} to control the number of
# occurrences of negative relations in the output dataframe for each entity pair, where 'ENTITY1-ENTITY2'
# represent the pairs of entities for relations separated by a `-` (include only one relation direction),
# and sample_weight_or_counts should be between 0 and 1 if `negative_relation_strategy` is 'weighted' or
# between 0 and the max number of occurrences of negative relations if `negative_relation_strategy` is
# 'counts', defaults to None
negative_relation_strategy_dict = {'DATE-PROBLEM': 0.1, 'TREATMENT-PROBLEM': 0.5, 'TEST-PROBLEM': 0.2},

# optional: list of NLP Lab task IDs to exclude from output dataframe, defaults to None
# excluded_task_ids = [2, 3]

# optional: list of NLP Lab task titles to exclude from output dataframe, defaults to None
# excluded_task_titles = ['Note 1'])

Generate JSON Containing Pre-annotations Using a Spark NLP Pipeline

pre_annotations, summary = alab.generate_preannotations(

# required: list of results.
all_results = results,

# requied: output column name of 'DocumentAssembler' stage - to get original document string.
document_column = 'document',

# required: column name(s) of ner model(s). Note: multiple NER models can be used, but make sure their results don't overrlap.
# Or use 'ChunkMergeApproach' to combine results from multiple NER models.
ner_columns = ['ner_chunk'],

# optional: column name(s) of assertion model(s). Note: multiple assertion models can be used, but make sure their results don't overrlap.
# assertion_columns = ['assertion_res'],

# optional: column name(s) of relation extraction model(s). Note: multiple relation extraction models can be used, but make sure their results don't overrlap.
# relations_columns = ['relations_clinical', 'relations_pos'],

# optional: This can be defined to identify which pipeline/user/model was used to get predictions.
# Default: 'model'
# user_name = 'model',

# optional: Option to assign custom titles to tasks. By default, tasks will be titled as 'task_#'
# titles_list = [],

# optional: If there are already tasks in project, then this id offset can be used to make sure default titles 'task_#' do not overlap.
# While upload a batch after the first one, this can be set to number of tasks currently present in the project
# This number would be added to each tasks's ID and title.
# id_offset=0)

Interacting with NLP Lab

alab = AnnotationLab()

username=''
password=''
client_secret=''
annotationlab_url=''

alab.set_credentials(

# required: username
username=username,

# required: password
password=password,

# required: secret for you alab instance (every alab installation has a different secret)
client_secret=client_secret,

# required: http(s) url for you NLP lab
annotationlab_url=annotationlab_url)

Get All Visible Projects

alab.get_all_projects()

Create a New Project

alab.create_project(

# required: unique name of project
project_name = 'alab_demo',

# optional: other details about project. Default: Empty string
project_description='',

# optional: Sampling option of tasks. Default: random
project_sampling='',

# optional: Annotation Guidelines of project
project_instruction='')

Delete a Project

alab.delete_project(

# required: unique name of project
project_name = 'alab_demo',

# optional: confirmation for deletion. Default: False - will ask for confirmation. If set to true, will delete directly.
confirm=False)

Upload Tasks to a Project

alab.upload_tasks(

# required: name of project to upload tasks to
project_name='alab_demo',

# required: list of examples / tasks as string (One string is one task).
task_list=task_list,

# optional: Option to assign custom titles to tasks. By default, tasks will be titled as 'task_#'
title_list = [],

# optional: If there are already tasks in project, then this id offset can be used to make sure default titles 'task_#' do not overlap.
# While upload a batch after the first one, this can be set to number of tasks currently present in the project
# This number would be added to each tasks's ID and title.
id_offset=0)

Delete Tasks from a Project

alab.delete_tasks(

# required: name of project to upload tasks to
project_name='alab_demo',

# required: list of ids of tasks.
# note: you can get task ids from the above step. Look for 'task_ids' key.
task_ids=[1, 2],

# optional: confirmation for deletion. Default: False - will ask for confirmation. If set to true, will delete directly.
confirm=False)

Upload Pre-annotations to NLP Lab

alab.upload_preannotations(

# required: name of project to upload annotations to
project_name = 'alab_demo',

# required: preannotation JSON
preannotations = pre_annotations)

Deidentification Module

Spark NLP for Healthcare provides functionality to apply Deidentification using easy-to-use module named Deid.

The Deid module is a tool for deidentifying Personal Health Information from data in a file path. It can be used with custom SparkNLP NER pipelines or without any pipeline specified. It returns the deidentification results as a pyspark dataframe as well as a csv or json file. The module also includes functionality for applying Structured Deidentification task to data from a file path.

The function, deidentify(), can be used with a custom pipeline or without defining any custom pipeline.
structured_deidentifier() function can be used for the Structured Deidentification task.

Apply Deidentification With a Custom Pipeline

from sparknlp_jsl import Deid

deid_implementor= Deid(
# required: Spark session with spark-nlp-jsl jar
spark,

# required: The path of the input file. Default is None. File type must be 'csv' or 'json'.
input_file_path="data.csv",

#optional:  The path of the output file. Default is 'deidentified.csv'. File type must be 'csv' or 'json'.
output_file_path="deidentified.csv",

#optional: The separator of the input csv file. Default is "\t".
separator=",",

#optional: A custom pipeline model to be used for deidentification. If not specified, the default is None.
custom_pipeline=nlpModel,

#optional: Fields to be deidentified and their deidentification modes, by default {"text": "mask"}
fields={"text": "mask", "text_1": "obfuscate"},

#optional:  The masking policy. Default is "entity_labels".
masking_policy="fixed_length_chars",

#optional: The fixed mask length. Default is 4.
fixed_mask_length=4,

#optional: The final chunk column name of the custom pipeline that will be deidentified, if specified. Default is "ner_chunk".
ner_chunk="ner_chunk",

#optional: The corresponding document column name of the custom pipeline, if specified. Default is "document"
document="document",

#optional: The corresponding sentence column name of the custom pipeline, if specified. Default is "sentence"
sentence="sentence",

#optional: The corresponding token column name of the custom pipeline, if specified. Default is "token"
token="token",

#optional: The source of the reference file for obfuscation. Default is "faker".
#obfuscate_ref_source="both",

#optional: The path of the reference file for obfuscation. Default is None.
#obfuscate_ref_file_path="obfuscation.txt",

#optional: Obfuscate date. Default is True.
#obfuscate_date=True,

#optional: The document hash coder column name. Default is "documentHash".
#documentHashCoder_col_name= "documentHash"

#optional: ID column name. Default is "id".
#id_column_name= "ID"

#optional: Date shift column name. Default is "date_shift".
#date_shift_column_name= "date_shift"

#optional: Json file path for multi-mode Deid. Default is NONE.
#multi_mode_file_path= "multi_mode_file_path.json"

#optional: The date tag. Default is "DATE".
#date_tag="DATE"

#optional: Language. Default is "en"
#language="en"

#optional: Region. Default is "us"
#region="us"

#optional: Age group obfuscation. Default is False.
#age_group_obfuscation=True

#optional: Age ranges for obfuscation. Default is [1, 4, 12, 20, 40, 60, 80].
#age_ranges=[1, 4, 12, 20, 40, 60, 80]

#optional:  Shift days. Default is False.
#shift_days=False

#optional: The number of days to shift. Default is None.
#number_of_days=5

#optional: Use unnormalized date. Default is False.
#unnormalized_date=True

#optional: The unnormalized mode. Default is "mask".
#unnormalized_mode="obfuscate"
)

res= deid_implementor.deidentify()

+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                             text_deidentified|                                                                text_1|                                                   text_1_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0|Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson ...|Record date : ** , ** , M.D . , Name : ** MR .|Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-...|Date : 10-16-1991 PCP : Alveda Castles , 26 years-old , Record date...|
+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+

Apply Deidentification With No Custom Pipeline


from sparknlp_jsl import Deid

deid_implementor= Deid(
# required: Spark session with spark-nlp-jsl jar
spark,

# required: The path of the input file. Default is None. File type must be 'csv' or 'json'.
input_file_path="data.csv",

#optional:  The path of the output file. Default is 'deidentified.csv'. File type must be 'csv' or 'json'.
output_file_path="deidentified.csv",

#optional: The separator of the input csv file. Default is "\t".
separator=",",

#optional: Fields to be deidentified and their deidentification modes, by default {"text": "mask"}
fields={"text": "mask"},

#optional: The masking policy. Default is "entity_labels".
masking_policy="entity_labels",

#optional: Json file path for multi-mode Deid. Default is NONE.
#multi_mode_file_path= "multi_mode_file_path.json",

#optional: Age group obfuscation. Default is False.
#age_group_obfuscation=True

#optional: Age ranges for obfuscation. Default is [1, 4, 12, 20, 40, 60, 80].
#age_ranges=[1, 4, 12, 20, 40, 60, 80]

#optional:  Shift days. Default is False.
#shift_days=False

#optional: The number of days to shift. Default is None.
#number_of_days=5

#optional: Use unnormalized date. Default is False.
#unnormalized_date=True

#optional: The unnormalized mode. Default is "mask".
#unnormalized_mode="obfuscate"
)

res= deid_implementor.deidentify()

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                         text_original|                                                             text_deid|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0|                                                                     "|                                                                     "|
|  1|Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson ...|Record date : <DATE> , <DOCTOR> , M.D . , Name : <PATIENT> , MR # <...|
|  2|                                                                     "|                                                                     "|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+

Apply Structured Deidentification

StructuredDeidentification is a helper class that allow to obfuscate a structured dataframe.

Parameters:

  • columns: A dictionary mapping DataFrame columns to entity names. The keys represent column names, and the values define the corresponding entity.
  • columnsSeed: A dictionary specifying a seed for obfuscating particular columns. The seed is used to randomly select entities during obfuscation.
  • obfuscateRefFile: (Optional) A file containing custom terms for obfuscation. The file should have entity names as keys and obfuscation terms as values.
  • days: The number of days to shift date entities. If not provided, a random integer between 1 and 60 is used.
  • useRandomDateDisplacement: If True, applies a random displacement to date entities. If False, uses the days parameter. Default is False.
  • dateFormats: A list of date formats to consider, e.g., ["dd-MM-yyyy", "dd/MM/yyyy", "d/M/yyyy"].
  • language: The language for selecting faker entities. Options: 'en', 'de', 'es', 'fr', 'ar', 'ro'. Default is 'en'.
  • idColumn: The column containing unique row IDs. If provided, ensures consistent obfuscation per ID, particularly for date entities.
  • region: Specifies regional date formats. Options: 'eu' (European Union), 'us' (USA). Default is an empty string (''), meaning dateFormats are used.
  • keepYear: If True, retains the original year in date obfuscation. Default is False.
  • keepMonth: If True, retains the original month in date obfuscation. Default is False.
  • unnormalizedDateMode: Defines behavior for unformatted dates. Options: 'mask', 'obfuscate', 'skip'. Default is 'obfuscate'.
  • keepTextSizeForObfuscation: If True, maintains the same character length in obfuscation output. Default is False.
  • fakerLengthOffset: Defines acceptable length deviation in obfuscation when keepTextSizeForObfuscation is enabled. Must be greater than 0. Default is 3.
  • genderAwareness: If True, applies gender-aware name obfuscation. May reduce performance. Default is False.
  • ageRangesByHipaa: If True, obfuscates ages based on HIPAA Privacy Rule. Default is False.
  • consistentAcrossNameParts : Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name). When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately. For example, if “John Smith” is obfuscated as “Liam Brown”, then:
  • When the full name “John Smith” appears, it will be replaced with “Liam Brown”
  • When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation. Default: True
  • selectiveObfuscateRefSource: Dict[str, str] A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source. Possible values in dict for the obfuscation source are: ‘custom’, ‘faker’, ‘both’, ‘file’.

Functions :

  • obfuscateColumns(self, df: DataFrame, outputAsArray: bool = True, overwrite: bool = True, suffix: str = “_obfuscated”):
    Obfuscate the columns of a dataframe.

    • df: The input DataFrame to be obfuscated.
    • outputAsArray: If True, the output will be an array of strings, otherwise will be a string. Default: True.
    • overwrite: If True, the columns will be overwritten, otherwise will be added to the dataframe. Default: True.
    • suffix: The suffix to add to the obfuscated columns if overwrite is False. Default: “_obfuscated”.

from sparknlp_jsl.structured_deidentification import StructuredDeidentification

obfuscator = StructuredDeidentification(spark,
                                        columns={"PATIENT":"PATIENT","DOB":"DATE","TEL":"PHONE"},
                                        columnsSeed={"PATIENT": 23, "DOB": 23, "TEL": 23},
                                        obfuscateRefSource = "faker",
                                        days=60,
                                        region="eu",
                                        keepYear=True,
                                        keepTextSizeForObfuscation=True
                                        )
obfuscator_df = obfuscator.obfuscateColumns(df, outputAsArray=False, overwrite=False, suffix="_obfuscated")
obfuscator_df.show(truncate=False)

Result

+---------------+----------+---+--------------+--------------+------------------+--------------+
|PATIENT        |DOB       |AGE|TEL           |DOB_obfuscated|PATIENT_obfuscated|TEL_obfuscated|
+---------------+----------+---+--------------+--------------+------------------+--------------+
|Cecilia Chapman|04/02/1935|83 |(257) 563-7401|05/04/1935    |Madlyn Schirmer   |(013) 149-3287|
|Iris Watson    |03/10/2009|9  |(372) 587-2335|02/12/2009    |Linard Reno       |(607) 830-7668|
|Bryar Pitts    |11/01/1921|98 |(717) 450-4729|12/03/1921    |Vila Grayer       |(606) 541-5638|
|Theodore Lowe  |13/02/2002|16 |(793) 151-6230|14/04/2002    |Simone Dubois     |(804) 262-9543|
|Calista Wise   |20/08/1942|76 |(492) 709-6392|19/10/1942    |Arne Langdon      |(250) 385-4950|
+---------------+----------+---+--------------+--------------+------------------+--------------+

Apply Date Shift Filler

DateShiftFiller is a helper class that fills missing or empty values in a date shift column using a deterministic, ID-based fallback approach.

This is especially useful in de-identification pipelines where:

  • Shift values must be consistent for the same ID.
  • Some rows may be missing or have null shift values.

Logic:

  • If another row with the same ID has a non-empty shift value, reuse it.
  • Otherwise, compute a fallback shift using a deterministic hash function based on ID and seed.
  • Fallback values are always in the range [1, maxShiftDays].

Parameters

  • spark: The active SparkSession.
  • seed: Seed value used for deterministic fallback hashing. Default is 42.
  • max_shift_days: The maximum number of days to shift when generating fallback values. Default is 60.

Functions

  • fill_missing_shifts(df, id_col, shift_col, suffix)
    Applies shift-filling logic to the given DataFrame.

    • df: The input DataFrame containing the shift column and ID.
    • id_col: The name of the column containing the grouping ID.
    • shift_col: The name of the date shift column to process.
    • suffix: The suffix to append to the output column (e.g., _filled).
import pandas as pd
from sparknlp_jsl.utils import DateShiftFiller

data = pd.DataFrame(
  {
    "patientID": ["A001", "A002", "A001", "A002", "A003", "A003"],
    "text": [
      "Chris Brown was discharged on 10/02/2022",
      "Mark White was discharged on 03/01/2020",
      "Chris Brown was born on 05/10/1982",
      "Mark White was born on 10/04/2000",
      "John was discharged on 03/15/2022",
      "John Moore was born on 12/31/2002",
    ],
    "dateshift": ["10", "-2", None, None, None, 5],
  }
)

input_df = spark.createDataFrame(data)

filler = DateShiftFiller(spark, seed=42, max_shift_days=60)
result_df = filler.fill_missing_shifts(
  input_df,
  id_col="patientID",
  shift_col="dateshift",
  suffix="_filled"
)

result_df.show(truncate=False)

Result

+---------+----------------------------------------+---------+------------------+
|patientID|text                                    |dateshift|dateshift_filled  |
+---------+----------------------------------------+---------+------------------+
|A002     |Mark White was discharged on 03/01/2020 |-2       |-2                |
|A001     |Chris Brown was discharged on 10/02/2022|10       |10                |
|A001     |Chris Brown was born on 05/10/1982      |NULL     |10                |
|A003     |John was discharged on 03/15/2022       |NULL     |5                 |
|A003     |John Moore was born on 12/31/2002       |5        |5                 |
|A002     |Mark White was born on 10/04/2000       |NULL     |-2                |
+---------+----------------------------------------+---------+------------------+

Compatibility

This module helps to find appropriate model versions depending your distribution of John Snow Labs products.

By searching our vast repository of models available at NLP Model Hub, we can return a JSON-like file with the models’s information (using method .findVersion()) or print the models that match a given query (using method .showVersion()).

To use it, simply run the following:

from johnsnowlabs import medical
# Or: from sparknlp_jsl.compatibility import Compatibility

compatibility = medical.Compatibility()

# Returns a list of dict objects
found_models = compatibility.findVersion('ner_clinical')

To tabulate and visualize all retrieved models, you can:

import pandas as pd

models_df = pd.DataFrame([dict(x) for x in list(models)])
models_df

|    | name                                                |   sparkVersion | version   | language   | date                    | readyToUse   |
|---:|:----------------------------------------------------|---------------:|:----------|:-----------|:------------------------|:-------------|
|  0 | ner_clinical_noncontrib                             |            2.4 | 2.3.0     | en         | 2019-11-14T17:07:35.434 | true         |
|  1 | ner_clinical_large                                  |            2.4 | 2.5.0     | en         | 2020-05-21T00:35:02.624 | true         |
|  2 | ner_clinical                                        |            3   | 3.0.0     | en         | 2021-01-27T12:52:59.087 | true         |
|  3 | ner_clinical_large_en                               |            3   | 3.0.0     | en         | 2021-03-31T12:32:55.357 | true         |
|  4 | ner_clinical                                        |            3   | 3.0.0     | en         | 2021-03-31T16:33:39.368 | true         |
|  5 | ner_clinical_large                                  |            3   | 3.0.0     | en         | 2021-03-31T15:55:14.650 | true         |
|  6 | ner_clinical_biobert                                |            3   | 3.0.0     | en         | 2021-04-01T07:06:52.919 | true         |
|  7 | ner_clinical                                        |            2.3 | 3.0.0     | en         | 2021-03-31T16:33:39.368 | true         |
|  8 | ner_clinical_biobert                                |            2.3 | 3.0.0     | en         | 2021-04-01T07:06:52.919 | true         |
|  9 | ner_clinical                                        |            2.3 | 3.0.0     | en         | 2021-01-27T12:52:59.087 | true         |
| 10 | ner_clinical                                        |            2.3 | 3.0.0     | en         | 2021-03-31T16:33:39.368 | true         |
| 11 | ner_clinical_large                                  |            2.3 | 3.0.0     | en         | 2021-03-31T15:55:14.650 | true         |
| 12 | bert_token_classifier_ner_clinical                  |            2.4 | 3.2.0     | en         | 2021-08-28T15:51:44.492 | true         |
| 13 | bert_token_classifier_ner_clinical                  |            2.4 | 3.3.4     | en         | 2022-01-06T12:42:21.908 | true         |
| 14 | bert_token_classifier_ner_clinical_pipeline         |            3   | 3.4.1     | en         | 2022-03-15T12:08:50.209 | true         |
| 15 | bert_token_classifier_ner_clinical_pipeline         |            2.4 | 3.4.1     | en         | 2022-03-15T12:56:42.874 | true         |
| 16 | ner_clinical_biobert_pipeline                       |            3   | 3.4.1     | en         | 2022-03-21T15:06:54.361 | true         |
| 17 | ner_clinical_large_pipeline                         |            3   | 3.4.1     | en         | 2022-03-21T14:29:11.545 | true         |
| 18 | ner_clinical_pipeline                               |            3   | 3.4.1     | en         | 2022-03-21T14:32:59.531 | true         |
| 19 | bert_token_classifier_ner_clinical_pipeline         |            3   | 3.4.1     | en         | 2022-03-21T18:51:36.583 | true         |
| 20 | ner_clinical_trials_abstracts                       |            3   | 3.5.3     | en         | 2022-06-22T15:26:56.789 | true         |
| 21 | ner_clinical_trials_abstracts_pipeline              |            3   | 3.5.3     | en         | 2022-06-27T07:07:17.828 | true         |
| 22 | bert_token_classifier_ner_clinical_trials_abstracts |            3   | 3.5.3     | en         | 2022-06-29T04:10:29.985 | true         |
| 23 | ner_clinical_bert                                   |            3   | 4.0.0     | ro         | 2022-06-30T21:36:31.573 | true         |
| 24 | ner_clinical                                        |            3   | 4.0.0     | ro         | 2022-07-01T14:55:02.322 | true         |
| 25 | ner_clinical_bert                                   |            3   | 4.0.2     | ro         | 2022-08-12T09:12:00.992 | true         |
| 26 | bert_token_classifier_ner_clinical_trials_abstracts |            3   | 4.0.2     | es         | 2022-08-11T14:45:17.151 | true         |
| 27 | ner_clinical_trials_abstracts                       |            3   | 4.0.2     | es         | 2022-08-12T21:19:27.613 | true         |
| 28 | ner_clinical_bert                                   |            3   | 4.2.2     | ro         | 2022-11-22T13:33:53.852 | true         |

Or simply run the showVersion() method instead:

compatibility.showVersion('ner_clinical')
+-----------------------------------------------------+------+---------+
| Pipeline/Model                                      | lang | version |
+-----------------------------------------------------+------+---------+
| ner_clinical_noncontrib                             |  en  | 2.3.0   |
| ner_clinical_large                                  |  en  | 2.5.0   |
| ner_clinical                                        |  en  | 3.0.0   |
| ner_clinical_large_en                               |  en  | 3.0.0   |
| ner_clinical                                        |  en  | 3.0.0   |
| ner_clinical_large                                  |  en  | 3.0.0   |
| ner_clinical_biobert                                |  en  | 3.0.0   |
| ner_clinical                                        |  en  | 3.0.0   |
| ner_clinical_biobert                                |  en  | 3.0.0   |
| ner_clinical                                        |  en  | 3.0.0   |
| ner_clinical                                        |  en  | 3.0.0   |
| ner_clinical_large                                  |  en  | 3.0.0   |
| bert_token_classifier_ner_clinical                  |  en  | 3.2.0   |
| bert_token_classifier_ner_clinical                  |  en  | 3.3.4   |
| bert_token_classifier_ner_clinical_pipeline         |  en  | 3.4.1   |
| bert_token_classifier_ner_clinical_pipeline         |  en  | 3.4.1   |
| ner_clinical_biobert_pipeline                       |  en  | 3.4.1   |
| ner_clinical_large_pipeline                         |  en  | 3.4.1   |
| ner_clinical_pipeline                               |  en  | 3.4.1   |
| bert_token_classifier_ner_clinical_pipeline         |  en  | 3.4.1   |
| ner_clinical_trials_abstracts                       |  en  | 3.5.3   |
| ner_clinical_trials_abstracts_pipeline              |  en  | 3.5.3   |
| bert_token_classifier_ner_clinical_trials_abstracts |  en  | 3.5.3   |
| ner_clinical_bert                                   |  ro  | 4.0.0   |
| ner_clinical                                        |  ro  | 4.0.0   |
| ner_clinical_bert                                   |  ro  | 4.0.2   |
| bert_token_classifier_ner_clinical_trials_abstracts |  es  | 4.0.2   |
| ner_clinical_trials_abstracts                       |  es  | 4.0.2   |
| ner_clinical_bert                                   |  ro  | 4.2.2   |
+-----------------------------------------------------+------+---------+

InternalResourceDownloader

This module has extended functinalities to list and download models from John Snow Labs repositories. It is an auxiliary module for finding and downloading different models for studies and analysis.

As with the Compatibility module, InternalResourceDownloader is also capable of displaying the available models. The difference is that this module can filter the results based on the Python’s class name of the annotator, while Compatibility searches for models’ name.

Displaying available models

To display the pipelines or models, you can use the .showPrivateModels(), .showPrivatePipelines(), .returnPrivateModels(), or .returnPrivatePipelines() methods, which return the results in a list or print the results directly.

For example, to list all models with class MedicalNerModel, just run (some results were ommited for brevity):

medical_ner_models = medical.InternalResourceDownloader.returnPrivateModels("MedicalNerModel")

medical_ner_models[0]
['nerdl_tumour_demo', 'en', '1.7.3']
medical.InternalResourceDownloader.showPrivateModels("MedicalNerModel")
+----------------------------------------+------+---------+
| Model                                  | lang | version |
+----------------------------------------+------+---------+
| ner_deid_subentity_bert                |  ro  | 4.0.0   |
| ner_deid_subentity                     |  ro  | 4.0.0   |
| ner_pathogen                           |  en  | 4.0.0   |
| ner_clinical_bert                      |  ro  | 4.0.0   |
| ner_clinical                           |  ro  | 4.0.0   |
| ner_ade_binary                         |  en  | 4.0.0   |
| ner_living_species_300                 |  es  | 4.0.0   |
| ner_clinical_bert                      |  ro  | 4.0.2   |
| ner_clinical_trials_abstracts          |  es  | 4.0.2   |
| ner_pharmacology                       |  es  | 4.0.2   |
| ner_negation_uncertainty               |  es  | 4.0.2   |
| disease_mentions_tweet                 |  es  | 4.0.2   |
| ner_deid_generic_bert                  |  ro  | 4.0.2   |
| ner_oncology_unspecific_posology_wip   |  en  | 4.0.0   |
| ner_oncology_wip                       |  en  | 4.0.0   |
| ner_oncology_therapy_wip               |  en  | 4.0.0   |
| ner_oncology_posology_wip              |  en  | 4.0.0   |
| ner_oncology_anatomy_general_wip       |  en  | 4.0.0   |
| ner_oncology_tnm_wip                   |  en  | 4.0.0   |
| ner_oncology_demographics_wip          |  en  | 4.0.0   |
| ner_oncology_biomarker_wip             |  en  | 4.0.0   |
| ner_oncology_anatomy_granular_wip      |  en  | 4.0.0   |
| ner_oncology_test_wip                  |  en  | 4.0.0   |
| ner_oncology_diagnosis_wip             |  en  | 4.0.0   |
| ner_oncology_response_to_treatment_wip |  en  | 4.0.0   |
| ner_jsl                                |  en  | 4.2.0   |
| ner_covid_trials                       |  en  | 4.2.0   |
| ner_oncology_unspecific_posology       |  en  | 4.0.0   |
| ner_oncology                           |  en  | 4.0.0   |
| ner_oncology_tnm                       |  en  | 4.0.0   |
| ner_oncology_anatomy_general           |  en  | 4.0.0   |
| ner_oncology_therapy                   |  en  | 4.0.0   |
| ner_oncology_test                      |  en  | 4.0.0   |
| ner_oncology_diagnosis                 |  en  | 4.0.0   |
| ner_oncology_demographics              |  en  | 4.0.0   |
| ner_oncology_anatomy_granular          |  en  | 4.0.0   |
| ner_oncology_response_to_treatment     |  en  | 4.0.0   |
| ner_oncology_posology                  |  en  | 4.0.0   |
| ner_oncology_biomarker                 |  en  | 4.0.0   |
| ner_sdoh_slim_wip                      |  en  | 4.2.1   |
| ner_clinical_bert                      |  ro  | 4.2.2   |
| ner_living_species_300                 |  es  | 4.2.2   |
| ner_deid_generic_bert                  |  ro  | 4.2.2   |
| ner_oncology_biomarker                 |  en  | 4.2.2   |
| ner_oncology_response_to_treatment     |  en  | 4.2.2   |
| ner_oncology_demographics              |  en  | 4.2.2   |
| ner_oncology_therapy                   |  en  | 4.2.2   |
| ner_oncology                           |  en  | 4.2.2   |
| ner_oncology_anatomy_granular          |  en  | 4.2.2   |
| ner_oncology_anatomy_general           |  en  | 4.2.2   |
| ner_oncology_diagnosis                 |  en  | 4.2.2   |
| ner_oncology_tnm                       |  en  | 4.2.2   |
| ner_oncology_posology                  |  en  | 4.2.2   |
| ner_oncology_unspecific_posology       |  en  | 4.2.2   |
| ner_oncology_test                      |  en  | 4.2.2   |
+----------------------------------------+------+---------+

ModelTracer

This module adds information on the data to help track uids and timestamps of each stage of the pipeline.

Given the following pipeline for Medical NER:

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper") #decide if we want to return the tags in upper or lower case

ner_converter = NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter
        ])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

To add the UID and timestamp of each pipeline step, simply use

from sparknlp_jsl.modelTracer import ModelTracer

df = model.transform(empty_data)
tracer_result = ModelTracer().addUidCols(pipeline = nlpPipeline, df = df)
tracer_result.show(truncate=False)
+----+------------------------------------------+--------+-----+----------+---+---------+----------------------------------------------------------------------+----------------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------------------------------+--------------------------------------------------------------------+-----------------------------------------------------------------+
|text|document                                  |sentence|token|embeddings|ner|ner_chunk|documentassembler_model_uid                                           |sentencedetectordlmodel_model_uid                                           |tokenizer_model_uid                                           |word_embeddings_model_model_uid                                           |medicalnermodel_model_uid                                           |nerconverter_model_uid                                           |
+----+------------------------------------------+--------+-----+----------+---+---------+----------------------------------------------------------------------+----------------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------------------------------+--------------------------------------------------------------------+-----------------------------------------------------------------+
|    |[{document, 0, -1, , {sentence -> 0}, []}]|[]      |[]   |[]        |[] |[]       |{uid -> DocumentAssembler_3e110f5ce3dc, timestamp -> 2022-10-21_22:58}|{uid -> SentenceDetectorDLModel_6bafc4746ea5, timestamp -> 2022-10-21_22:58}|{uid -> Tokenizer_bd74fe5f5860, timestamp -> 2022-10-21_22:58}|{uid -> WORD_EMBEDDINGS_MODEL_9004b1d00302, timestamp -> 2022-10-21_22:58}|{uid -> MedicalNerModel_1a8637089929, timestamp -> 2022-10-21_22:58}|{uid -> NerConverter_643c903e9161, timestamp -> 2022-10-21_22:58}|
+----+------------------------------------------+--------+-----+----------+---+---------+----------------------------------------------------------------------+----------------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------------------------------+--------------------------------------------------------------------+-----------------------------------------------------------------+

Pipeline Tracer

PipelineTracer is a class that allows you to trace the stages of a pipeline and get information about them.

The PipelineTracer class provides functionality for tracing and retrieving information about the various stages of a pipeline. It can be used to obtain detailed insights into the entities, assertions, and relationships utilized within the pipeline. Compatibility with both PipelineModel and PretrainedPipeline.

It can be used with a PipelineModel or a PretrainedPipeline. Additionally, it can be used to create a parser dictionary that can be used to create a PipelineOutputParser.

Parameters

  • printPipelineSchema: Prints the schema of the pipeline.
  • createParserDictionary: Returns a parser dictionary that can be used to create a PipelineOutputParser
  • getPossibleEntities: Returns a list of possible entities that the pipeline can include.
  • getPossibleAssertions: Returns a list of possible assertions that the pipeline can include
  • getPossibleRelations: Returns a list of possible relations that the pipeline can include.
  • getPipelineStages: Returns a list of PipelineStage objects that represent the stages of the pipeline.
  • getParserDictDirectly: Returns a parser dictionary that can be used to create a PipelineOutputParser. This method is used to get the parser dictionary directly without creating a PipelineTracer object.
  • listAvailableModels: Returns a list of available models for a given language and source
  • showAvailableModels: Prints a list of available models for a given language and source.
from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

showAvailableModels

PipelineTracer.showAvailableModels(language="en", source="clinical/models")
clinical_deidentification
explain_clinical_doc_ade
explain_clinical_doc_biomarker
explain_clinical_doc_generic
explain_clinical_doc_granular
explain_clinical_doc_medication
explain_clinical_doc_oncology
explain_clinical_doc_public_health
explain_clinical_doc_radiology
explain_clinical_doc_risk_factors
explain_clinical_doc_vop
icd10cm_resolver_pipeline
icd10cm_rxnorm_resolver_pipeline
rxnorm_resolver_pipeline
snomed_resolver_pipeline

PipelineOutputParser

The output parser module seamlessly integrates with existing systems to return clear, easy-to-read & process prettified results in dictionary format from pretrained pipelines. It is designed to enhance API integration, and user understanding and streamline data analysis workflows. This class is used to parse the output of a Spark NLP pipeline. It provides methods for extracting named entities, assertions, code mapping, relations, summaries, and deidentification from the output.


import sparknlp
import sparknlp_jsl
from sparknlp.pretrained import PretrainedPipeline
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

## Load the pipeline and the text

text = [
    "Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.",
    "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago." +
    "The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later." +
    "He underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."
]

oncology_pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
annotations = oncology_pipeline.fullAnnotate(text)

## Define the column_maps dictionary

column_maps = {
    'document_identifier': 'XYZ_123',
    'document_text': 'document',
    'entities': ["merged_chunk"],
    'assertions': ["assertion_chunk"],
    'resolver': ["icd10_code"],
    'relations': [],
    'summaries': [],
    "deidentifications": [],
    "classifications": []
}

## Initialized parser from the dictionary

pipeline_parser = PipelineOutputParser(column_maps)

## Run the parser on the output of a Spark NLP pipeline

parsed_result = pipeline_parser.run(annotations)
print(parsed_result)

RelationalDBDeidentification

The RelationalDBDeidentification class provides a robust solution for de-identifying sensitive data in relational databases. It supports a variety of obfuscation techniques and integrates seamlessly with database systems.

Key features include:

  • End-to-End De-Identification:
    • deidentify(): Automates the de-identification process by:
      • Fetching tables.
      • Extracting schema information.
      • Detecting sensitive columns.
      • Applying obfuscation and masking techniques.
      • Exporting de-identified data as CSV files.
  • Database Connectivity:
    • connect_to_db(): Establishes a connection to the MySQL database.
    • get_all_tables(): Retrieves all table names from the connected database.
  • Schema and Data Processing:
    • get_schema_info(table_name): Extracts schema details, including date columns, primary keys, and foreign keys, for a specified table.
  • Data Obfuscation:
    • obfuscate_dates(df, date_columns): Shifts dates by a specified number of days.
    • obfuscate_ages(df, age_columns, use_hipaa): Obfuscates age columns using HIPAA rules or predefined age groups.
    • mask_other_sensitive_columns(df, other_columns): Masks sensitive columns by replacing their values with asterisks.

This class provides a complete framework for protecting sensitive information while maintaining data integrity for relational databases.


Example Usage

from sparknlp_jsl.utils.database_deidentification import RelationalDBDeidentification

config = {
    "db_config": {
        "host": "localhost",
        "user": "root",
        "password": "root",
        "database": "healthcare_db"
    },
    "deid_options": {
        "days_to_shift": 10,
        "age_groups": {
            "child": (0, 12),
            "teen": (13, 19),
            "adult": (20, 64),
            "senior": (65, 90)
        },
        "pk_fk_shift_value": 100,
        "use_hipaa": False,
        "output_path": "deidentified_output/"
    },
    "logging": {
        "level": "INFO",
        "file": "deidentification.log"
    }
}

deidentifier = RelationalDBDeidentification(spark, config)
deidentifier.deidentify()

Example for appointments

appointment_id patient_id doctor_name appointment_date reason
1 1 Dr. Emily Carter 2024-01-15 Annual Checkup
2 2 Dr. Sarah Johnson 2024-02-10 Flu Symptoms
3 1 Dr. Emily Carter 2024-02-15 Follow-up Visit
4 1 Dr. James Wilson 2024-03-20 Routine Blood Test

Result for appointments (De-identified table):

appointment_id patient_id doctor_name appointment_date reason
101 101 ***** 2024-01-25 Annual Checkup
102 102 ***** 2024-02-20 Flu Symptoms
103 101 ***** 2024-02-25 Follow-up Visit
104 101 ***** 2024-03-30 Routine Blood Test

Example for patients

patient_id name address ssn email dob age
1 John Doe 123 Main St, Springfield 123-45-6789 john.doe@example.com 1985-04-15 38
2 Jane Smith 456 Elm St, Shelbyville 987-65-4321 jane.smith@example.com 1990-07-20 33

Result for patients (De-identified table):

patient_id name address ssn email dob age
101 ***** ***** ***** ***** 1985-04-25 39
102 ***** ***** ***** ***** 1990-07-30 62

Please check the 4.8.Clinical_Deidentification_for_Structured_Data Notebook for more information.

Apply Exception Handling

The apply_exception_handling utility sets setDoExceptionHandling(True) for specified or all stages in a Spark NLP pipeline, making your pipeline more robust to errors in supported stages.

Parameters:

  • pipeline: Spark NLP Pipeline or PipelineModel object
  • stage: List of stage indices to set exception handling (if parameter exists)
  • all: If True, applies to all stages with the parameter. (Default: True)

Import & Usage

from sparknlp_jsl.utils import apply_exception_handling

Checking Stage Exception Handling

from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
stage = oncology_pipeline.model.stages[15]
stage.getDoExceptionHandling()

Apply Exception Handling to Pipeline

handled_pretrainedPipeline = apply_exception_handling(oncology_pipeline)

Output:

Total modified stages: 15

  • Stage 4: MedicalNerModel
  • Stage 5: NerConverterInternalModel
  • Stage 6: MedicalNerModel
  • Stage 7: NerConverterInternalModel
  • Stage 8: MedicalNerModel
  • Stage 9: NerConverterInternalModel
  • Stage 10: MedicalNerModel
  • Stage 11: NerConverterInternalModel
  • Stage 13: ChunkMergeModel
  • Stage 14: ChunkMergeModel
  • Stage 15: AssertionDLModel
  • Stage 16: ChunkFilterer
  • Stage 17: AssertionDLModel
  • Stage 21: RelationExtractionModel
  • Stage 22: RelationExtractionModel

Total skipped stages: 9

  • Stage 0: DocumentAssembler (No exception handling support)
  • Stage 1: SentenceDetectorDLModel (No exception handling support)
  • Stage 2: TokenizerModel (No exception handling support)
  • Stage 3: WordEmbeddingsModel (No exception handling support)
  • Stage 12: TextMatcherInternalModel (No exception handling support)
  • Stage 18: AssertionMerger (No exception handling support)
  • Stage 19: PerceptronModel (No exception handling support)
  • Stage 20: DependencyParserModel (No exception handling support)
  • Stage 23: AnnotationMerger (No exception handling support)

Confirm Exception Handling Setting

stage = handled_pretrainedPipeline.stages[15]
stage.getDoExceptionHandling()

Output:

True

Using with LightPipeline

from sparknlp.base import LightPipeline

light_model = LightPipeline(handled_pretrainedPipeline)
result = light_model.fullAnnotate("The patient is a 55-years old. He has a history of smoking and lung cancer.")

Annotation2Training

The Annotation2Training utility converts annotation results from JSON or CSV files into a DataFrame suitable for NER training.
Input files must have a structure similar to those produced by John Snow Labs’ Generative AI annotation tool.


from sparknlp_jsl.training import Annotation2Training
from sparknlp_jsl.annotator import *
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark.ml import Pipeline

# Create a base pipeline to extract sentences and tokens
document_assembler = DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")
tokenizer = Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

pipeline = Pipeline(stages=[
  document_assembler,
  sentence_detector,
  tokenizer,
])

pipeline_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

# Load the annotation2training utility class
annotation2training = Annotation2Training(spark)

# Convert a json file to a training dataframe
training_df = annotation2training.convertJson2NerDF(
  json_path="path/to/annotations.json",  # Path to GenAI json file
  pipeline_model=pipeline_model,
  repartition=32,                        # Number of partitions to use when creating the DataFrame
  token_output_col="token",
  ner_label_col="label"
)

training_df.show(5)
# Returns a Spark DataFrame to train NER models.
# Ready to go for MedicalNerApproach

dict_to_annotation_converter

This method is used to convert a list of dictionaries to a Spark DataFrame. The returned DataFrame will contain chunk and document columns that are compatible with Spark NLP and can be used for deidentification.

Key Points

  • The input data should be a list of dictionaries.
  • Each dictionary must have a "text" key (document text) and a "chunks" key (list of chunk dictionaries).
  • Each chunk dictionary should have "begin", "end", "result", "entity", and "metadata" keys.
  • Optionally, "document_metadata" and other custom columns can be included.
  • Returns a Spark DataFrame compatible with Spark NLP (ready for deidentification).
  • Use adjust_end=True if your chunk end indices are not already minus 1.
  • Flexible validation options via chunk_validation_options.

Parameters:

  • spark: SparkSession
  • data: list of dictionaries
  • document_col_name: Name for document column (default: "document")
  • chunk_col_name: Name for chunk column (default: "chunk")
  • adjust_end: If True, chunk end indices will be minus 1 (default: False)
  • chunk_validation_options: Dict for validation options, e.g.:
    • "validation_mode": "filter" (default) or "error"
    • "scope_window": [before, after] chars to check chunk span
    • "case_sensitive": Boolean for matching case

Example

list_of_dict = [
  {
    "text": "My name is George, and I was born on 12/11/1995. I have the pleasure of working at John Snow Labs.",
    "chunks": [
      {
        "begin": 11,
        "end": 16, 
        "result": "George",
        "entity": "PERSON",
        "metadata": {"confidence": "1", "ner_source": "llm_output"}
      },
      {
        "begin": 37,
        "end": 46,
        "result": "12/11/1995",
        "entity": "DATE",
        "metadata": {"confidence": "0.9", "ner_source": "llm_output"}
      },
      {
        "begin": 83,
        "end": 96,
        "result": "John Snow labs",
        "entity": "ORG",
        "metadata": {"confidence": "0.87", "ner_source": "llm_output"}
      }
    ],
    "doc_id": "1",
    "document_metadata": {"dateshift": "10"},
    "file_path": "/path/to/file1"
  },
  {
    "text": "I am Bush, and English is my native language. You can reach me at my email: bush@example.com.",
    "chunks": [
      {
        "begin": 5,
        "end": 8,
        "result": "Bush",
        "entity": "PERSON",
        "metadata": {"confidence": "1", "ner_source": "ner_dl"}
      },
      {
        "begin": 15,
        "end": 21,
        "result": "English",
        "entity": "LANGUAGE",
        "metadata": {"confidence": "0.98", "ner_source": "ner_dl"}
      },
      {
        "begin": 76,
        "end": 91,
        "result": "bush@example.com",
        "entity": "EMAIL",
        "metadata": {"confidence": "0.87", "ner_source": "ner_dl"}
      }
    ],
    "doc_id": "2",
    "document_metadata": {"dateshift": "5"},
    "file_path": "/path/to/file2"
  }
]

from sparknlp.annotator import *
from sparknlp_jsl.utils import *

result_df = dict_to_annotation_converter(
  spark, 
  list_of_dict, 
  document_col_name="document", 
  chunk_col_name="chunk", 
  adjust_end=False
)

light_deIdentification = (
  LightDeIdentification()
  .setInputCols(["document", "chunk"])
  .setOutputCol("deidentified")
  .setMode("mask")
)
light_deIdentification.transform(result_df).selectExpr("deidentified.result").show(truncate=False)

## llm_df_preprocessor

Preprocesses text data in a DataFrame by adding prefix and/or suffix prompts for LLM usage.

This function takes a PySpark DataFrame containing text data and creates prompts suitable for Large Language Model (LLM) processing by concatenating a prefix prompt, the original text, and optionally a suffix prompt. This is particularly useful for batch processing of text data in distributed computing environments.

The function supports both in-place column updates and creation of new columns, making it flexible for different use cases. It performs comprehensive input validation to ensure data integrity and provides clear error messages for troubleshooting.


Arguments

  • data_frame (DataFrame): A PySpark DataFrame containing the text data to be processed. Must be a valid PySpark DataFrame with at least one column.
  • text_col_name (str): The name of the column containing the text data that will be processed. This column must exist in the DataFrame and be of StringType.
  • prefix_prompt (str, optional): The prompt text to be added at the beginning of each text entry. Can be empty string if only suffix is needed. Defaults to "".
  • suffix_prompt (str, optional): The prompt text to be added at the end of each text entry. Can be empty string if only prefix is needed. Defaults to "".
  • new_text_col_name (str, optional): The name of the new column to create with the processed prompts. If None or same as text_col_name, the original column will be updated in-place. Defaults to None.

Returns

  • DataFrame: A new PySpark DataFrame with the processed prompt column. The DataFrame contains all original columns plus the new/updated prompt column with the concatenated prefix + text + suffix format.

Example Usage

df = spark.createDataFrame([
    ("The weather is nice today",),
    ("It will rain tomorrow",),
    ("I love sunny days",)
], ["text"])

result_df = llm_df_preprocessor(
    data_frame=df,
    text_col_name="text",
    prefix_prompt="Analyze the sentiment of this text: ",
    new_text_col_name="prompt"
)
result_df.show(truncate=False)

Output:

+-------------------------+-------------------------------------------------------------+
|text                     |prompt                                                       |
+-------------------------+-------------------------------------------------------------+
|The weather is nice today|Analyze the sentiment of this text: The weather is nice today|
|It will rain tomorrow    |Analyze the sentiment of this text: It will rain tomorrow    |
|I love sunny days        |Analyze the sentiment of this text: I love sunny days        |
+-------------------------+-------------------------------------------------------------+

Notes:

  • At least one of prefix_prompt or suffix_prompt must be provided (non-empty).
  • The function preserves all original columns in the DataFrame.
  • For large datasets, consider caching the input DataFrame before calling this function multiple times: data_frame.cache().
  • If the text column contains null values, they will be treated as empty strings in the concatenation operation.
  • The function is designed to work efficiently with Spark’s distributed computing model and can handle large-scale text preprocessing tasks.

vision_llm_preprocessor

Loads images from a specified path as raw bytes and adds a prompt column for Vision LLM processing.

This function is specifically designed for MedicalVisionLLM and other vision-language models that require images in raw byte format rather than OpenCV-compatible format. It loads images from a directory path, preserves them as raw bytes along with their metadata, and adds a user-defined prompt that will be associated with each image for downstream processing.

Supports common image formats and is optimized for distributed processing of large image datasets in medical and computer vision applications.


Arguments

  • spark (SparkSession): An active SparkSession instance.
  • images_path (str): The file system path to the directory containing images or a specific image file pattern.
    Supports both local file system and distributed file systems (HDFS, S3, etc.).
    Supported formats: JPEG, PNG, GIF, and BMP.
    Examples: "/path/to/images/", "s3://bucket/images/*.jpg"
  • prompt (str): The text prompt to be associated with each image.
    This prompt will be stored in the specified output column and can be used for vision-language model instructions, descriptions, or queries.
    Must be a non-empty string.
  • output_col_name (str, optional): The name of the column where the prompt will be stored in the resulting DataFrame.
    Must be a valid column name. Defaults to "text".

Returns

  • DataFrame: A PySpark DataFrame with image raw bytes, metadata, and prompt column.

Example Usage

df = vision_llm_preprocessor(
    spark=spark,
    images_path="/path/to/medical/images/",
    prompt="Analyze this medical image for abnormalities",
    output_col_name="medical_prompt"
)
df.show(5, truncate=False)

Technical Requirements:

  • File system permissions must allow read access to the specified path.

  • MedicalVisionLLM: For processing the output DataFrame with vision-language models
  • ImageAssembler: The underlying Spark NLP component used for image loading
Last updated