sparknlp.annotator.pos.perceptron#

Contains classes for the Perceptron Annotator.

Module Contents#

Classes#

PerceptronApproach

Trains an averaged Perceptron model to tag words part-of-speech. Sets a

PerceptronModel

Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to

class PerceptronApproach[source]#

Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

For pretrained models please see the PerceptronModel.

The training data needs to be in a Spark DataFrame, where the column needs to consist of Annotations of type POS. The Annotation needs to have member result set to the POS tag and have a "word" mapping to its word inside of member metadata. This DataFrame for training can easily created by the helper class POS.

>>> POS().readDataset(spark, datasetPath) \
...     .selectExpr("explode(tags) as tags").show(truncate=False)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
                        ...

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

TOKEN, DOCUMENT

POS

Parameters:
posCol

Column name for Array of POS tags that match tokens

nIterations

Number of iterations in training, converges to better accuracy, by default 5

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
>>> trainingPerceptronDF = POS().readDataset(spark, datasetPath)
>>> trainedPos = PerceptronApproach() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("pos") \
...     .setPosColumn("tags") \
...     .fit(trainingPerceptronDF)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     trainedPos
... ])
>>> data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+
posCol[source]#
nIterations[source]#
setPosColumn(self, value)[source]#

Sets column name for Array of POS tags that match tokens.

Parameters:
valuestr

Name of column for Array of POS tags

setIterations(self, value)[source]#

Sets number of iterations in training, by default 5.

Parameters:
valueint

Number of iterations in training

getNIterations(self)[source]#

Gets number of iterations in training, by default 5.

Returns:
int

Number of iterations in training

class PerceptronModel(classname='com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel', java_model=None)[source]#

Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

This is the instantiated model of the PerceptronApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> posTagger = PerceptronModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("pos")

The default model is "pos_anc", if no name is provided.

For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

TOKEN, DOCUMENT

POS

Parameters:
None

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> posTagger = PerceptronModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("pos")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     posTagger
... ])
>>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(pos) as pos").show(truncate=False)
+-------------------------------------------+
|pos                                        |
+-------------------------------------------+
|[pos, 0, 4, NNP, [word -> Peter], []]      |
|[pos, 6, 11, NNP, [word -> Pipers], []]    |
|[pos, 13, 21, NNS, [word -> employees], []]|
|[pos, 23, 25, VBP, [word -> are], []]      |
|[pos, 27, 33, VBG, [word -> picking], []]  |
|[pos, 35, 39, NNS, [word -> pecks], []]    |
|[pos, 41, 42, IN, [word -> of], []]        |
|[pos, 44, 50, JJ, [word -> pickled], []]   |
|[pos, 52, 58, NNS, [word -> peppers], []]  |
+-------------------------------------------+
name = PerceptronModel[source]#
static pretrained(name='pos_anc', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “pos_anc”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
PerceptronModel

The restored model