`sparknlp_jsl.annotator.annotation_converter`#

Contains classes for the AnnotationConverter.

Module Contents#

Classes#

AnnotationConverter

A flexible converter for transforming annotations in a DataFrame using custom logic.

class AnnotationConverter(f=None, classname='com.johnsnowlabs.annotator.AnnotationConverter')#

Bases: sparknlp.internal.AnnotatorTransformer, pyspark.ml.param.shared.HasInputCol, pyspark.ml.param.shared.HasOutputCol

A flexible converter for transforming annotations in a DataFrame using custom logic.

This class allows users to define custom conversion functions (f) to modify annotations, enabling transformations like: - Assertion outputs → Chunk outputs - LLM outputs → Document outputs - rule-based outputs → Updated outputs

The converter integrates with PySpark NLP-style pipelines (e.g., DocumentAssembler, Tokenizer) but operates purely in Python (not Scala).

Input Annotation Type	Output Annotation Type
`ANY`	`ANY`

Parameters:

f (function) – User-defined function to transform annotations.
inputCol (str) – Name of the input column containing annotations.
outputCol (str) – Name of the output column for converted annotations.
outputAnnotatorType (str) – Type of the output annotations (e.g., “token”).

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document')
>>> tokenizer = Tokenizer().setInputCols('document').setOutputCol('token')
>>> def tokenize(annotations):
...        new_annotations = []
...        pattern = r"(?<=[a-z])(?=[A-Z])"
...
...        for annotation in annotations:
...            text = annotation.result
...            import re
...            parts = re.split(pattern, text)
...            begin = annotation.begin
...            for part in parts:
...                end = begin + len(part) - 1
...                new_annotations.append(
...                    Annotation(
...                        annotatorType="token",
...                        begin=begin,
...                        end=end,
...                        result=part,
...                        metadata=annotation.metadata,
...                        embeddings=annotation.embeddings,
...                    )
...                )
...                begin = end + 1
...
...        return new_annotations
>>> camel_case_tokenizer = AnnotationConverter(f=tokenize)    ...        .setInputCol("token")    ...        .setOutputCol("camel_case_token")    ...        .setOutputAnnotatorType("token")
>>> pipeline = Pipeline(stages=[document_assembler, tokenizer, camel_case_tokenizer])
>>> text = "I like SparkNLP annotators such as MedicalBertForSequenceClassification and Chunk2Doc."
>>> test_data = spark.createDataFrame([[text]]).toDF("text")
>>> result = pipeline.fit(test_data).transform(test_data)
>>> result.selectExpr("explode (camel_case_token) as cct").select("cct.result", "cct.begin", "cct.end").show(truncate=False)
+--------------+-----+---+
|result        |begin|end|
+--------------+-----+---+
|I             |0    |0  |
|like          |2    |5  |
|Spark         |7    |11 |
|NLP           |12   |14 |
|annotators    |16   |25 |
|such          |27   |30 |
|as            |32   |33 |
|Medical       |35   |41 |
|Bert          |42   |45 |
|For           |46   |48 |
|Sequence      |49   |56 |
|Classification|57   |70 |
|and           |72   |74 |
|Chunk2Doc     |76   |84 |
|.             |85   |85 |
+--------------+-----+---+

functionStr#

getter_attrs = []#

inputCol :Param[str]#

name = 'AnnotationConverter'#

outputAnnotatorType#

outputCol :Param[str]#

uid = ''#

clear(param: pyspark.ml.param.Param) → None#: Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) → JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:: extra (dict, optional) – Extra parameters to copy to the new instance
Returns:: Copy of this instance
Return type:: JavaParams

explainParam(param: str | Param) → str#: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() → str#: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) → pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:: extra (dict, optional) – extra param values
Returns:: merged param map
Return type:: dict

getInputCol() → str#: Gets the value of inputCol or its default value.

getOrDefault(param: str) → Any#
getOrDefault(param: Param[T]) → T: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputAnnotatorType()#

getOutputCol() → str#: Gets the value of outputCol or its default value.

getParam(paramName: str) → Param#: Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:: paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) → bool#: Checks whether a param has a default value.

hasParam(paramName: str) → bool#: Tests whether this instance contains a param with a given (string) name.

isDefined(param: str | Param[Any]) → bool#: Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) → bool#: Checks whether a param is explicitly set by user.

classmethod load(path: str) → RL#: Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#: Returns an MLReader instance for this class.

save(path: str) → None#: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) → None#: Sets a parameter in the embedded param map.

setInputCol(value)#: Sets the value of inputCol.

setOutputAnnotatorType(value)#: Sets the value of outputAnnotatorType.

setOutputCol(value)#: Sets the value of outputCol.

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:: paramName (str) – Name of the parameter

setParams()#

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) → pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:

dataset (pyspark.sql.DataFrame) – input dataset
params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() → JavaMLWriter#: Returns an MLWriter instance for this ML instance.

sparknlp_jsl.annotator.annotation_converter#

Module Contents#

Classes#

`sparknlp_jsl.annotator.annotation_converter`#