sparknlp.training.POS

class sparknlp.training.POS[source]

Bases: sparknlp.internal.ExtendedJavaWrapper

Helper class for creating DataFrames for training a part-of-speech tagger.

The dataset needs to consist of sentences on each line, where each word is delimited with its respective tag.

Input File Format:

A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN

The sentence can then be parsed with readDataset() into a column with annotations of type POS.

Can be used to train a PerceptronApproach.

Examples

In this example, the file test-training.txt has the content of the sentence above.

>>> from sparknlp.training import POS
>>> pos = POS()
>>> path = "src/test/resources/anc-pos-corpus-small/test-training.txt"
>>> posDf = pos.readDataset(spark, path, "|", "tags")
>>> posDf.selectExpr("explode(tags) as tags").show(truncate=False)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 16, 17, CD, [word -> 61], []]          |
|[pos, 19, 23, NNS, [word -> years], []]      |
|[pos, 25, 27, JJ, [word -> old], []]         |
|[pos, 29, 29, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
|[pos, 51, 52, IN, [word -> as], []]          |
|[pos, 47, 47, DT, [word -> a], []]           |
|[pos, 56, 67, JJ, [word -> nonexecutive], []]|
|[pos, 69, 76, NN, [word -> director], []]    |
|[pos, 78, 81, NNP, [word -> Nov.], []]       |
|[pos, 83, 84, CD, [word -> 29], []]          |
|[pos, 81, 81, ., [word -> .], []]            |
+---------------------------------------------+

Methods

__init__()

apply()

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0.

new_java_array_integer(pylist)

new_java_array_string(pylist)

new_java_obj(java_class, *args)

readDataset(spark, path[, delimiter, ...])

Reads the dataset from an external resource.

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0. Review if spark changes

readDataset(spark, path, delimiter='|', outputPosCol='tags', outputDocumentCol='document', outputTextCol='text')[source]

Reads the dataset from an external resource.

Parameters
sparkpyspark.sql.SparkSession

Initiated Spark Session with Spark NLP

pathstr

Path to the resource

delimiterstr, optional

Delimiter of word and POS, by default “|”

outputPosColstr, optional

Name of the output POS column, by default “tags”

outputDocumentColstr, optional

Name of the output document column, by default “document”

outputTextColstr, optional

Name of the output text column, by default “text”

Returns
pyspark.sql.DataFrame

Spark Dataframe with the data