Contains classes for CoNLL.

Module Contents#



Instantiates the class to read a CoNLL dataset.

class CoNLL(documentCol='document', sentenceCol='sentence', tokenCol='token', posCol='pos', conllLabelIndex=3, conllPosIndex=1, textCol='text', labelCol='label', explodeSentences=True, delimiter=' ')[source]#

Instantiates the class to read a CoNLL dataset.

The dataset should be in the format of CoNLL 2003 and needs to be specified with readDataset(), which will create a dataframe with the data.

Can be used to train a NerDLApproach.

Input File Format:


rejects VBZ B-VP O
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
documentColstr, optional

Name of the DocumentAssembler column, by default ‘document’

sentenceColstr, optional

Name of the SentenceDetector column, by default ‘sentence’

tokenColstr, optional

Name of the Tokenizer column, by default ‘token’

posColstr, optional

Name of the PerceptronModel column, by default ‘pos’

conllLabelIndexint, optional

Index of the label column in the dataset, by default 3

conllPosIndexint, optional

Index of the POS tags in the dataset, by default 1

textColstr, optional

Index of the text column in the dataset, by default ‘text’

labelColstr, optional

Name of the label column, by default ‘label’

explodeSentencesbool, optional

Whether to explode sentences to separate rows, by default True

delimiter: str, optional

Delimiter used to separate columns inside CoNLL file


>>> from import CoNLL
>>> trainingData = CoNLL().readDataset(spark, "src/test/resources/conll2003/eng.train")
>>> trainingData.selectExpr(
...     "text",
...     "token.result as tokens",
...     "pos.result as pos",
...     "label.result as label"
... ).show(3, False)
|text                                            |tokens                                                    |pos                                  |label                                    |
|EU rejects German call to boycott British lamb .|[EU, rejects, German, call, to, boycott, British, lamb, .]|[NNP, VBZ, JJ, NN, TO, VB, JJ, NN, .]|[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]|
|Peter Blackburn                                 |[Peter, Blackburn]                                        |[NNP, NNP]                           |[B-PER, I-PER]                           |
|BRUSSELS 1996-08-22                             |[BRUSSELS, 1996-08-22]                                    |[NNP, CD]                            |[B-LOC, O]                               |
readDataset(self, spark, path, read_as=ReadAs.TEXT, partitions=8, storage_level=pyspark.StorageLevel.DISK_ONLY)[source]#

Reads the dataset from an external resource.


Initiated Spark Session with Spark NLP


Path to the resource, it can take two forms; a path to a conll file, or a path to a folder containing multiple CoNLL files. When the path points to a folder, the path must end in ‘*’. Examples:

“/path/to/single/file.conll’ “/path/to/folder/containing/multiple/files/*

read_asstr, optional

How to read the resource, by default ReadAs.TEXT

partitionssets the minimum number of partitions for the case of lifting multiple files in parallel into a single dataframe. Defaults to 8.
storage_levelsets the persistence level according to PySpark definitions. Defaults to StorageLevel.DISK_ONLY. Applies only when lifting multiple files.

Spark Dataframe with the data