sparknlp.training.CoNLL

class sparknlp.training.CoNLL(documentCol='document', sentenceCol='sentence', tokenCol='token', posCol='pos', conllLabelIndex=3, conllPosIndex=1, textCol='text', labelCol='label', explodeSentences=True, delimiter=' ')[source]

Bases: sparknlp.internal.ExtendedJavaWrapper

Instantiates the class to read a CoNLL dataset.

The dataset should be in the format of CoNLL 2003 and needs to be specified with readDataset(), which will create a dataframe with the data.

Can be used to train a NerDLApproach.

Input File Format:

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
Parameters
documentColstr, optional

Name of the DocumentAssembler column, by default ‘document’

sentenceColstr, optional

Name of the SentenceDetector column, by default ‘sentence’

tokenColstr, optional

Name of the Tokenizer column, by default ‘token’

posColstr, optional

Name of the PerceptronModel column, by default ‘pos’

conllLabelIndexint, optional

Index of the label column in the dataset, by default 3

conllPosIndexint, optional

Index of the POS tags in the dataset, by default 1

textColstr, optional

Index of the text column in the dataset, by default ‘text’

labelColstr, optional

Name of the label column, by default ‘label’

explodeSentencesbool, optional

Whether to explode sentences to separate rows, by default True

delimiter: str, optional

Delimiter used to separate columns inside CoNLL file

Examples

>>> from sparknlp.training import CoNLL
>>> trainingData = CoNLL().readDataset(spark, "src/test/resources/conll2003/eng.train")
>>> trainingData.selectExpr(
...     "text",
...     "token.result as tokens",
...     "pos.result as pos",
...     "label.result as label"
... ).show(3, False)
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|text                                            |tokens                                                    |pos                                  |label                                    |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|EU rejects German call to boycott British lamb .|[EU, rejects, German, call, to, boycott, British, lamb, .]|[NNP, VBZ, JJ, NN, TO, VB, JJ, NN, .]|[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]|
|Peter Blackburn                                 |[Peter, Blackburn]                                        |[NNP, NNP]                           |[B-PER, I-PER]                           |
|BRUSSELS 1996-08-22                             |[BRUSSELS, 1996-08-22]                                    |[NNP, CD]                            |[B-LOC, O]                               |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+

Methods

__init__([documentCol, sentenceCol, ...])

apply()

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0.

new_java_array_integer(pylist)

new_java_array_string(pylist)

new_java_obj(java_class, *args)

readDataset(spark, path[, read_as])

Reads the dataset from an external resource.

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0. Review if spark changes

readDataset(spark, path, read_as='TEXT')[source]

Reads the dataset from an external resource.

Parameters
sparkpyspark.sql.SparkSession

Initiated Spark Session with Spark NLP

pathstr

Path to the resource

read_asstr, optional

How to read the resource, by default ReadAs.TEXT

Returns
pyspark.sql.DataFrame

Spark Dataframe with the data