sparknlp_jsl.annotator.annotation_converter#

Contains classes for the AnnotationConverter.

Module Contents#

Classes#

AnnotationConverter

A flexible converter for transforming annotations in a DataFrame using custom logic.

class AnnotationConverter(f=None, classname='com.johnsnowlabs.annotator.AnnotationConverter')#

Bases: sparknlp.internal.AnnotatorTransformer, pyspark.ml.param.shared.HasInputCol, pyspark.ml.param.shared.HasOutputCol

A flexible converter for transforming annotations in a DataFrame using custom logic.

This class allows users to define custom conversion functions (f) to modify annotations, enabling transformations like: - Assertion outputs → Chunk outputs - LLM outputs → Document outputs - rule-based outputs → Updated outputs

The converter integrates with PySpark NLP-style pipelines (e.g., DocumentAssembler, Tokenizer) but operates purely in Python (not Scala).

Input Annotation Type

Output Annotation Type

ANY

ANY

Parameters:
  • f (function) – User-defined function to transform annotations.

  • inputCol (str) – Name of the input column containing annotations.

  • outputCol (str) – Name of the output column for converted annotations.

  • outputAnnotatorType (str) – Type of the output annotations (e.g., “token”).

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document')
>>> tokenizer = Tokenizer().setInputCols('document').setOutputCol('token')
>>> def tokenize(annotations):
...        new_annotations = []
...        pattern = r"(?<=[a-z])(?=[A-Z])"
...
...        for annotation in annotations:
...            text = annotation.result
...            import re
...            parts = re.split(pattern, text)
...            begin = annotation.begin
...            for part in parts:
...                end = begin + len(part) - 1
...                new_annotations.append(
...                    Annotation(
...                        annotatorType="token",
...                        begin=begin,
...                        end=end,
...                        result=part,
...                        metadata=annotation.metadata,
...                        embeddings=annotation.embeddings,
...                    )
...                )
...                begin = end + 1
...
...        return new_annotations
>>> camel_case_tokenizer = AnnotationConverter(f=tokenize)    ...        .setInputCol("token")    ...        .setOutputCol("camel_case_token")    ...        .setOutputAnnotatorType("token")
>>> pipeline = Pipeline(stages=[document_assembler, tokenizer, camel_case_tokenizer])
>>> text = "I like SparkNLP annotators such as MedicalBertForSequenceClassification and Chunk2Doc."
>>> test_data = spark.createDataFrame([[text]]).toDF("text")
>>> result = pipeline.fit(test_data).transform(test_data)
>>> result.selectExpr("explode (camel_case_token) as cct").select("cct.result", "cct.begin", "cct.end").show(truncate=False)
+--------------+-----+---+
|result        |begin|end|
+--------------+-----+---+
|I             |0    |0  |
|like          |2    |5  |
|Spark         |7    |11 |
|NLP           |12   |14 |
|annotators    |16   |25 |
|such          |27   |30 |
|as            |32   |33 |
|Medical       |35   |41 |
|Bert          |42   |45 |
|For           |46   |48 |
|Sequence      |49   |56 |
|Classification|57   |70 |
|and           |72   |74 |
|Chunk2Doc     |76   |84 |
|.             |85   |85 |
+--------------+-----+---+
functionStr#
getter_attrs = []#
inputCol :Param[str]#
name = 'AnnotationConverter'#
outputAnnotatorType#
outputCol :Param[str]#
uid = ''#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getInputCol() str#

Gets the value of inputCol or its default value.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputAnnotatorType()#
getOutputCol() str#

Gets the value of outputCol or its default value.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setInputCol(value)#

Sets the value of inputCol.

setOutputAnnotatorType(value)#

Sets the value of outputAnnotatorType.

setOutputCol(value)#

Sets the value of outputCol.

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.