sparknlp.training.CoNLLU

class sparknlp.training.CoNLLU(explodeSentences=True)[source]

Bases: sparknlp.internal.ExtendedJavaWrapper

Instantiates the class to read a CoNLL-U dataset.

The dataset should be in the format of CoNLL-U and needs to be specified with readDataset(), which will create a dataframe with the data.

Can be used to train a DependencyParserApproach

Input File Format:

# sent_id = 1
# text = They buy and sell books.
1   They     they    PRON    PRP    Case=Nom|Number=Plur               2   nsubj   2:nsubj|4:nsubj   _
2   buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0   root    0:root            _
3   and      and     CONJ    CC     _                                  4   cc      4:cc              _
4   sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2   conj    0:root|2:conj     _
5   books    book    NOUN    NNS    Number=Plur                        2   obj     2:obj|4:obj       SpaceAfter=No
6   .        .       PUNCT   .      _                                  2   punct   2:punct           _

Examples

>>> from sparknlp.training import CoNLLU
>>> conlluFile = "src/test/resources/conllu/en.test.conllu"
>>> conllDataSet = CoNLLU(False).readDataset(spark, conlluFile)
>>> conllDataSet.selectExpr(
...     "text",
...     "form.result as form",
...     "upos.result as upos",
...     "xpos.result as xpos",
...     "lemma.result as lemma"
... ).show(1, False)
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
|text                                   |form                                          |upos                                         |xpos                          |lemma                                       |
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
|What if Google Morphed Into GoogleOS?  |[What, if, Google, Morphed, Into, GoogleOS, ?]|[PRON, SCONJ, PROPN, VERB, ADP, PROPN, PUNCT]|[WP, IN, NNP, VBD, IN, NNP, .]|[what, if, Google, morph, into, GoogleOS, ?]|
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+

Methods

__init__([explodeSentences])

apply()

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0.

new_java_array_integer(pylist)

new_java_array_string(pylist)

new_java_obj(java_class, *args)

readDataset(spark, path[, read_as])

Reads the dataset from an external resource.

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0. Review if spark changes

readDataset(spark, path, read_as='TEXT')[source]

Reads the dataset from an external resource.

Parameters
sparkpyspark.sql.SparkSession

Initiated Spark Session with Spark NLP

pathstr

Path to the resource

read_asstr, optional

How to read the resource, by default ReadAs.TEXT

Returns
pyspark.sql.DataFrame

Spark Dataframe with the data