Training

 

Training Datasets

POS Dataset

In order to train a Part of Speech Tagger annotator, we need to get corpus data as a spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a spark dataset.

Input File Format:

A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN

Available parameters are:

  • spark: Spark session
  • path(string): Path to file with corpus data for training POS
  • delimiter(string): Delimiter of token and postag. Defaults to |
  • outputPosCol(string): Name of the column with POS values. Defaults to “tags”.

Example:

Refer to the POS Scala docs for more details on the API.

from sparknlp.training import POS
train_pos = POS().readDataset(spark, "./src/main/resources/anc-pos-corpus")
import com.johnsnowlabs.nlp.training.POS
val trainPOS = POS().readDataset(spark, "./src/main/resources/anc-pos-corpus")

CoNLL Dataset

In order to train a Named Entity Recognition DL annotator, we need to get CoNLL format data as a spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a spark dataset.

Constructor parameters:

  • documentCol: String = “document”,
  • sentenceCol: String = “sentence”,
  • tokenCol: String = “token”,
  • posCol: String = “pos”,
  • conllLabelIndex: Int = 3,
  • conllPosIndex: Int = 1,
  • conllTextCol: String = “text”,
  • labelCol: String = “label”,
  • explodeSentences: Boolean = false

Available parameters are:

  • spark: Spark session
  • path(string): Path to a CoNLL 2003 IOB NER file.
  • readAs(string): Can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used (default LINE_BY_LINE)

Example:

Refer to the CoNLL Scala docs for more details on the API.

from sparknlp.training import CoNLL
training_conll = CoNLL().readDataset(spark, "./src/main/resources/conll2003/eng.train")
import com.johnsnowlabs.nlp.training.CoNLL
val trainingConll = CoNLL().readDataset(spark, "./src/main/resources/conll2003/eng.train")

Spell Checkers Dataset

In order to train a Norvig or Symmetric Spell Checkers, we need to get corpus data as a spark dataframe. We can read a plain text file and transforms it to a spark dataset.

Example:

train_corpus = spark.read.text("./sherlockholmes.txt")
                    .withColumnRenamed("value", "text")
val trainCorpus = spark.read.text("./sherlockholmes.txt")
                       .select(trainCorpus.col("value").as("text"))

Vivekn Sentiment Analysis Dataset

To train ViveknSentimentApproach, it is needed to have input columns DOCUMENT and TOKEN, and a String column which is set with setSentimentCol stating either positive or negative

Last updated