Helper functions

Spark NLP Annotation functions

The functions presented here help users manipulate annotations, by providing both UDFs and dataframe utilities to deal with them more easily

Python

In python, the functions are straight forward and have both UDF and Dataframe applications

map_annotations(f, output_type: DataType) UDF that applies f(). Requires output DataType from pyspark.sql.types
map_annotations_strict(f) UDF that apples an f() method that returns a list of Annotations
map_annotations_col(dataframe: DataFrame, f, column: str, output_column: str, annotatyon_type: str, output_type: DataType = Annotation.arrayType()) applies f() to column from dataframe
map_annotations_cols(dataframe: DataFrame, f, columns: str, output_column: str, annotatyon_type: str, output_type: DataType = Annotation.arrayType()) applies f() to columns from dataframe
filter_by_annotations_col(dataframe, f, column) applies a boolean filter f() to column from dataframe
explode_annotations_col(dataframe: DataFrame, column, output_column) explodes annotation column from dataframe

Scala

In Scala, importing inner functions brings implicits that allow these functions to be applied directly on top of the dataframe

mapAnnotations(function: Seq[Annotation] => T, outputType: DataType)
mapAnnotationsStrict(function: Seq[Annotation] => Seq[Annotation])
mapAnnotationsCol[T: TypeTag](column: String, outputCol: String,annotatorType: String, function: Seq[Annotation] => T)
mapAnnotationsCol[T: TypeTag](cols: Seq[String], outputCol: String,annotatorType: String, function: Seq[Annotation] => T)
eachAnnotationsCol[T: TypeTag](column: String, function: Seq[Annotation] => Unit)
def explodeAnnotationsCol[T: TypeTag](column: String, outputCol: String)

Imports:

from sparknlp.functions import *
from sparknlp.annotation import Annotation

import com.johnsnowlabs.nlp.functions._
import com.johnsnowlabs.nlp.Annotation

Examples:

Complete usage examples can be seen here: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb

def my_annoation_map_function(annotations):
    return list(map(lambda a: Annotation(
        'my_own_type',
        a.begin,
        a.end,
        a.result,
        {'my_key': 'custom_annotation_data'},
        []), annotations))
        
result.select(
    map_annotations(my_annoation_map_function, Annotation.arrayType())('token')
).toDF("my output").show(truncate=False)

val modified = data.mapAnnotationsCol("pos", "mod_pos","pos" ,(_: Seq[Annotation]) => {
      "hello world"
    })

PREVIOUSGPU vs CPU Training

NEXTScala API (Scaladoc)