sparknlp.base.LightPipeline

class sparknlp.base.LightPipeline(pipelineModel, parse_embeddings=False)[source]

Bases: object

Creates a LightPipeline from a Spark PipelineModel.

LightPipeline is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline. The difference is that it’s execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data. This means, we do not input a Spark Dataframe, but a string or an Array of strings instead, to be annotated. To create Light Pipelines, you need to input an already trained (fit) Spark ML Pipeline.

It’s transform() has now an alternative annotate(), which directly outputs the results.

Parameters
pipelineModelpyspark.ml.PipelineModel

The PipelineModel containing Spark NLP Annotators

parse_embeddingsbool, optional

Whether to parse embeddings, by default False

Notes

Use fullAnnotate() to also output the result as Annotation, with metadata.

Examples

>>> from sparknlp.base import LightPipeline
>>> light = LightPipeline(pipeline.fit(data))
>>> light.annotate("We are very happy about Spark NLP")
{
    'document': ['We are very happy about Spark NLP'],
    'lemmas': ['We', 'be', 'very', 'happy', 'about', 'Spark', 'NLP'],
    'pos': ['PRP', 'VBP', 'RB', 'JJ', 'IN', 'NNP', 'NNP'],
    'sentence': ['We are very happy about Spark NLP'],
    'spell': ['We', 'are', 'very', 'happy', 'about', 'Spark', 'NLP'],
    'stems': ['we', 'ar', 'veri', 'happi', 'about', 'spark', 'nlp'],
    'token': ['We', 'are', 'very', 'happy', 'about', 'Spark', 'NLP']
}

Methods

__init__(pipelineModel[, parse_embeddings])

annotate(target)

Annotates the data provided, extracting the results.

fullAnnotate(target)

Annotates the data provided into Annotation type results.

getIgnoreUnsupported()

Gets whether to ignore unsupported AnnotatorModels.

setIgnoreUnsupported(value)

Sets whether to ignore unsupported AnnotatorModels.

transform(dataframe)

Transforms a dataframe provided with the stages of the LightPipeline.

annotate(target)[source]

Annotates the data provided, extracting the results.

The data should be either a list or a str.

Parameters
targetlist or str

The data to be annotated

Returns
List[dict] or dict

The result of the annotation

Examples

>>> from sparknlp.pretrained import PretrainedPipeline
>>> explain_document_pipeline = PretrainedPipeline("explain_document_dl")
>>> result = explain_document_pipeline.annotate('U.N. official Ekeus heads for Baghdad.')
>>> result.keys()
dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])
>>> result["ner"]
['B-ORG', 'O', 'O', 'B-PER', 'O', 'O', 'B-LOC', 'O']
fullAnnotate(target)[source]

Annotates the data provided into Annotation type results.

The data should be either a list or a str.

Parameters
targetlist or str

The data to be annotated

Returns
List[Annotation]

The result of the annotation

Examples

>>> from sparknlp.pretrained import PretrainedPipeline
>>> explain_document_pipeline = PretrainedPipeline("explain_document_dl")
>>> result = explain_document_pipeline.fullAnnotate('U.N. official Ekeus heads for Baghdad.')
>>> result[0].keys()
dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])
>>> result[0]["ner"]
[Annotation(named_entity, 0, 2, B-ORG, {'word': 'U.N'}),
Annotation(named_entity, 3, 3, O, {'word': '.'}),
Annotation(named_entity, 5, 12, O, {'word': 'official'}),
Annotation(named_entity, 14, 18, B-PER, {'word': 'Ekeus'}),
Annotation(named_entity, 20, 24, O, {'word': 'heads'}),
Annotation(named_entity, 26, 28, O, {'word': 'for'}),
Annotation(named_entity, 30, 36, B-LOC, {'word': 'Baghdad'}),
Annotation(named_entity, 37, 37, O, {'word': '.'})]
getIgnoreUnsupported()[source]

Gets whether to ignore unsupported AnnotatorModels.

Returns
bool

Whether to ignore unsupported AnnotatorModels.

setIgnoreUnsupported(value)[source]

Sets whether to ignore unsupported AnnotatorModels.

Parameters
valuebool

Whether to ignore unsupported AnnotatorModels.

Returns
LightPipeline

The current LightPipeline

transform(dataframe)[source]

Transforms a dataframe provided with the stages of the LightPipeline.

Parameters
dataframepyspark.sql.DataFrame

The Dataframe to be transformed

Returns
pyspark.sql.DataFrame

The transformed DataFrame