sparknlp_jsl.annotator.annotation_converter
#
Contains classes for the AnnotationConverter.
Module Contents#
Classes#
A flexible converter for transforming annotations in a DataFrame using custom logic. |
- class AnnotationConverter(f=None, classname='com.johnsnowlabs.annotator.AnnotationConverter')#
Bases:
sparknlp.internal.AnnotatorTransformer
,pyspark.ml.param.shared.HasInputCol
,pyspark.ml.param.shared.HasOutputCol
A flexible converter for transforming annotations in a DataFrame using custom logic.
This class allows users to define custom conversion functions (f) to modify annotations, enabling transformations like: - Assertion outputs → Chunk outputs - LLM outputs → Document outputs - rule-based outputs → Updated outputs
The converter integrates with PySpark NLP-style pipelines (e.g., DocumentAssembler, Tokenizer) but operates purely in Python (not Scala).
Input Annotation Type
Output Annotation Type
ANY
ANY
- Parameters:
f (function) – User-defined function to transform annotations.
inputCol (str) – Name of the input column containing annotations.
outputCol (str) – Name of the output column for converted annotations.
outputAnnotatorType (str) – Type of the output annotations (e.g., “token”).
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> import sparknlp_jsl >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document') >>> tokenizer = Tokenizer().setInputCols('document').setOutputCol('token') >>> def tokenize(annotations): ... new_annotations = [] ... pattern = r"(?<=[a-z])(?=[A-Z])" ... ... for annotation in annotations: ... text = annotation.result ... import re ... parts = re.split(pattern, text) ... begin = annotation.begin ... for part in parts: ... end = begin + len(part) - 1 ... new_annotations.append( ... Annotation( ... annotatorType="token", ... begin=begin, ... end=end, ... result=part, ... metadata=annotation.metadata, ... embeddings=annotation.embeddings, ... ) ... ) ... begin = end + 1 ... ... return new_annotations >>> camel_case_tokenizer = AnnotationConverter(f=tokenize) ... .setInputCol("token") ... .setOutputCol("camel_case_token") ... .setOutputAnnotatorType("token") >>> pipeline = Pipeline(stages=[document_assembler, tokenizer, camel_case_tokenizer]) >>> text = "I like SparkNLP annotators such as MedicalBertForSequenceClassification and Chunk2Doc." >>> test_data = spark.createDataFrame([[text]]).toDF("text") >>> result = pipeline.fit(test_data).transform(test_data) >>> result.selectExpr("explode (camel_case_token) as cct").select("cct.result", "cct.begin", "cct.end").show(truncate=False) +--------------+-----+---+ |result |begin|end| +--------------+-----+---+ |I |0 |0 | |like |2 |5 | |Spark |7 |11 | |NLP |12 |14 | |annotators |16 |25 | |such |27 |30 | |as |32 |33 | |Medical |35 |41 | |Bert |42 |45 | |For |46 |48 | |Sequence |49 |56 | |Classification|57 |70 | |and |72 |74 | |Chunk2Doc |76 |84 | |. |85 |85 | +--------------+-----+---+
- functionStr#
- getter_attrs = []#
- inputCol :Param[str]#
- name = 'AnnotationConverter'#
- outputAnnotatorType#
- outputCol :Param[str]#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCol() str #
Gets the value of inputCol or its default value.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputAnnotatorType()#
- getOutputCol() str #
Gets the value of outputCol or its default value.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setOutputAnnotatorType(value)#
Sets the value of
outputAnnotatorType
.
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.