`sparknlp.annotator.pos.perceptron`#

Contains classes for the Perceptron Annotator.

Module Contents#

Classes#

`PerceptronApproach`	Trains an averaged Perceptron model to tag words part-of-speech. Sets a
`PerceptronModel`	Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to

class PerceptronApproach[source]#

Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

For pretrained models please see the PerceptronModel.

The training data needs to be in a Spark DataFrame, where the column needs to consist of Annotations of type POS. The Annotation needs to have member result set to the POS tag and have a "word" mapping to its word inside of member metadata. This DataFrame for training can easily created by the helper class POS.

>>>>>> POS().readDataset(spark, datasetPath) \
...     .selectExpr("explode(tags) as tags").show(truncate=False)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
                        ...

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`TOKEN, DOCUMENT`	`POS`

Parameters:

posCol: Column name for Array of POS tags that match tokens
nIterations: Number of iterations in training, converges to better accuracy, by default 5

Examples

>>>>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
>>> trainingPerceptronDF = POS().readDataset(spark, datasetPath)
>>> trainedPos = PerceptronApproach() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("pos") \
...     .setPosColumn("tags") \
...     .fit(trainingPerceptronDF)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     trainedPos
... ])
>>> data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+

setPosColumn(value)[source]#

Sets column name for Array of POS tags that match tokens.

Parameters:

valuestr: Name of column for Array of POS tags

setIterations(value)[source]#

Sets number of iterations in training, by default 5.

Parameters:

valueint: Number of iterations in training

getNIterations()[source]#

Gets number of iterations in training, by default 5.

Returns:

int: Number of iterations in training

class PerceptronModel(classname='com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel', java_model=None)[source]#

Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

This is the instantiated model of the PerceptronApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>>>>> posTagger = PerceptronModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("pos")

The default model is "pos_anc", if no name is provided.

For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`TOKEN, DOCUMENT`	`POS`

Parameters:

None

Examples

>>>>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> posTagger = PerceptronModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("pos")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     posTagger
... ])
>>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(pos) as pos").show(truncate=False)
+-------------------------------------------+
|pos                                        |
+-------------------------------------------+
|[pos, 0, 4, NNP, [word -> Peter], []]      |
|[pos, 6, 11, NNP, [word -> Pipers], []]    |
|[pos, 13, 21, NNS, [word -> employees], []]|
|[pos, 23, 25, VBP, [word -> are], []]      |
|[pos, 27, 33, VBG, [word -> picking], []]  |
|[pos, 35, 39, NNS, [word -> pecks], []]    |
|[pos, 41, 42, IN, [word -> of], []]        |
|[pos, 44, 50, JJ, [word -> pickled], []]   |
|[pos, 52, 58, NNS, [word -> peppers], []]  |
+-------------------------------------------+

static pretrained(name='pos_anc', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “pos_anc”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

PerceptronModel: The restored model

sparknlp.annotator.pos.perceptron#

Module Contents#

Classes#

`sparknlp.annotator.pos.perceptron`#