sparknlp_jsl.utils.llm_utils#

Module Contents#

Functions#

llm_df_preprocessor(→ pyspark.sql.DataFrame)

Preprocesses text data in a DataFrame by adding prefix and/or suffix prompts for LLM usage.

vision_llm_preprocessor(→ pyspark.sql.DataFrame)

Loads images from a specified path as raw bytes and adds a prompt column for Vision LLM processing.

llm_df_preprocessor(data_frame: pyspark.sql.DataFrame, text_col_name: str, prefix_prompt: str = '', suffix_prompt: str = '', new_text_col_name: str = None) pyspark.sql.DataFrame#

Preprocesses text data in a DataFrame by adding prefix and/or suffix prompts for LLM usage.

This function takes a PySpark DataFrame containing text data and creates prompts suitable for Large Language Model (LLM) processing by concatenating a prefix prompt, the original text, and optionally a suffix prompt. This is particularly useful for batch processing of text data in distributed computing environments.

The function supports both in-place column updates and creation of new columns, making it flexible for different use cases. It performs comprehensive input validation to ensure data integrity and provides clear error messages for troubleshooting.

Parameters:
  • (DataFrame) (data_frame) – Must be a valid PySpark DataFrame with at least one column.

  • (str) (text_col_name) – processed. This column must exist in the DataFrame and be of StringType.

  • (str (new_text_col_name) – text entry. Can be empty string if only suffix is needed. Defaults to “”.

  • optional) (The name of the new column to create with the) – text entry. Can be empty string if only suffix is needed. Defaults to “”.

  • (str – entry. Can be empty string if only prefix is needed. Defaults to “”.

  • optional) – entry. Can be empty string if only prefix is needed. Defaults to “”.

  • (str – processed prompts. If None or same as text_col_name, the original column will be updated in-place. Defaults to None.

  • optional) – processed prompts. If None or same as text_col_name, the original column will be updated in-place. Defaults to None.

Returns:

DataFrame – contains all original columns plus the new/updated prompt column with the concatenated prefix + text + suffix format.

Return type:

A new PySpark DataFrame with the processed prompt column. The DataFrame

Examples

>>> df = spark.createDataFrame([
...     ("The weather is nice today",),
...     ("It will rain tomorrow",),
...     ("I love sunny days",)
... ], ["text"])
>>>
>>> result_df = llm_df_preprocessor(
...     data_frame=df,
...     text_col_name="text",
...     prefix_prompt="Analyze the sentiment of this text: ",
...     new_text_col_name="prompt"
... )
>>> result_df.show(truncate=False)
+-------------------------+-------------------------------------------------------------+
|text                     |prompt                                                       |
+-------------------------+-------------------------------------------------------------+
|The weather is nice today|Analyze the sentiment of this text: The weather is nice today|
|It will rain tomorrow    |Analyze the sentiment of this text: It will rain tomorrow    |
|I love sunny days        |Analyze the sentiment of this text: I love sunny days        |
+-------------------------+-------------------------------------------------------------+

Notes

  • At least one of prefix_prompt or suffix_prompt must be provided (non-empty).

  • The function preserves all original columns in the DataFrame.

  • For large datasets, consider caching the input DataFrame before calling this function multiple times: data_frame.cache().

  • If the text column contains null values, they will be treated as empty strings in the concatenation operation.

  • The function is designed to work efficiently with Spark’s distributed computing model and can handle large-scale text preprocessing tasks.

vision_llm_preprocessor(spark: pyspark.sql.SparkSession, images_path: str, prompt: str, output_col_name: str = 'text') pyspark.sql.DataFrame#

Loads images from a specified path as raw bytes and adds a prompt column for Vision LLM processing.

This function is specifically designed for MedicalVisionLLM and other vision-language models that require images in raw byte format rather than OpenCV-compatible format. It loads images from a directory path, preserves them as raw bytes along with their metadata, and adds a user-defined prompt that will be associated with each image for downstream processing.

The function supports common image formats and is optimized for distributed processing of large image datasets in medical and computer vision applications.

Parameters:
  • spark (SparkSession) – An active SparkSession instance

  • images_path (str) – The file system path to the directory containing images or a specific image file pattern. Supports both local file system and distributed file systems (HDFS, S3, etc.). Supported formats: JPEG, PNG, GIF, and BMP. Examples: “/path/to/images/”, “s3://bucket/images/*.jpg”

  • prompt (str) – The text prompt to be associated with each image. This prompt will be stored in the specified output column and can be used for vision-language model instructions, descriptions, or queries. Must be a non-empty string.

  • output_col_name (str, optional) – The name of the column where the prompt will be stored in the resulting DataFrame. Must be a valid column name. Defaults to “text”.

Returns:

DataFrame

Return type:

A PySpark DataFrame

Examples

>>> df = vision_llm_preprocessor(
...     spark=spark,
...     images_path="/path/to/medical/images/",
...     prompt="Analyze this medical image for abnormalities",
...     output_col_name="medical_prompt"
... )
>>> df.show(5, truncate=False)
Technical Requirements:
  • File system permissions must allow read access to the specified path.

See also

  • MedicalVisionLLM: For processing the output DataFrame with vision-language models

  • ImageAssembler: The underlying Spark NLP component used for image loading