sparknlp.functions.map_annotations_col

sparknlp.functions.map_annotations_col(dataframe: pyspark.sql.dataframe.DataFrame, f, column: str, output_column: str, annotatyon_type: str, output_type: pyspark.sql.types.DataType = ArrayType(StructType(List(StructField(annotatorType, StringType, false), StructField(begin, IntegerType, false), StructField(end, IntegerType, false), StructField(result, StringType, false), StructField(metadata, MapType(StringType, StringType, true), false), StructField(embeddings, ArrayType(FloatType, true), false))), true))[source]

Creates a Spark UDF to map over a column of Annotation results.

Parameters
dataframeDataFrame

Input DataFrame

ffunction

Function to apply to the column

columnstr

Name of the input column

output_columnstr

Name of the output column

annotatyon_typestr

Annotator type

output_typeDataType, optional

Output type, by default Annotation.arrayType()

Returns
pyspark.sql.DataFrame

Transformed DataFrame

Examples

>>> from sparknlp.pretrained import PretrainedPipeline
>>> from sparknlp.functions import *
>>> explain_document_pipeline = PretrainedPipeline("explain_document_dl")
>>> data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
>>> result = explain_document_pipeline.transform(data)
>>> chunks_df = map_annotations_col(
...     result,
...     lambda x: [
...         Annotation("chunk", a.begin, a.end, a.result, a.metadata, a.embeddings)
...         for a in x
...     ],
...     "pos",
...     "pos_chunk",
...     "chunk",
... )
>>> chunks_df.selectExpr("explode(pos_chunk)").show()
+--------------------+
|                 col|
+--------------------+
|[chunk, 0, 2, NNP...|
|[chunk, 3, 3, ., ...|
|[chunk, 5, 12, JJ...|
|[chunk, 14, 18, N...|
|[chunk, 20, 24, V...|
|[chunk, 26, 28, I...|
|[chunk, 30, 36, N...|
|[chunk, 37, 37, ....|
+--------------------+