SparkNLP- Documentation and Reference

Last updated:

Concepts

Spark NLP Imports: Getting ready to work

Since version 1.5.0 we are making necessary imports easy to reach, base will include general Spark NLP transformers and concepts, while annotator will include all annotators that we currently provide. We also need SparkML pipelines.
Example:

from sparknlp.base import *
from sparknlp.annotator import *

Since version 1.5.0 we are making necessary imports easy to reach, base._ will include general Spark NLP transformers and concepts, while annotator._ will include all annotators that we currently provide. We also need SparkML pipelines.
Example:

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._

Spark ML Pipelines

SparkML Pipelines are a uniform structure that helps creating and tuning practical machine learning pipelines. Spark NLP integrates with them seamlessly so it is important to have this concept handy. Once a Pipeline is trained with fit(), this becomes a PipelineModel
Example:

from pyspark.ml import Pipeline
pipeline = Pipeline().setStages([...])

SparkML Pipelines are a uniform structure that helps creating and tuning practical machine learning pipelines. Spark NLP integrates with them seamlessly so it is important to have this concept handy. Once a Pipeline is trained with fit(), this becomes a PipelineModel
Example:

import org.apache.spark.ml.Pipeline
new Pipeline().setStages(Array(...))

LightPipeline: A super-fast Spark-NLP pipeline for small data

LightPipelines are Spark ML pipelines converted into a single machine but multithreaded task, becoming more than 10x times faster for smaller amounts of data (50k lines of text or below). To use them, simply plug in a trained (fitted) pipeline.
Example:

from sparknlp.base import LightPipeline
LightPipeline(someTrainedPipeline).annotate(someStringOrArray)
Functions:
  • annotate(string or string[]): returns dictionary list of annotation results
  • fullAnnotate(string or string[]): returns dictionary list of entire annotations content

LightPipelines are Spark ML pipelines converted into a single machine but multithreaded task, becoming more than 10x times faster for smaller amounts of data (50k lines of text or below). To use them, simply plug in a trained (fitted) pipeline.
Example:

import com.johnsnowlabs.nlp.LightPipeline
new LightPipeline(somePipelineModel).annotate(someStringOrArray))
Functions:
  • annotate(string or string[]): returns dictionary list of annotation results
  • fullAnnotate(string or string[]): returns dictionary list of entire annotations content

RecursivePipeline: A smarter Spark-NLP pipeline

Recursive pipelines are SparkNLP specific pipelines that allow a Spark ML Pipeline to know about itself on every Pipeline Stage task, allowing annotators to utilize this same pipeline against external resources to process them in the same way the user decides. Only some of our annotators take advantage of this. RecursivePipeline behaves exactly the same than normal Spark ML pipelines, so they can be used with the same intention.
Example:

from sparknlp.annotator import *
recursivePipeline = RecursivePipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        lemmatizer,
        finisher
      ])

Recursive pipelines are SparkNLP specific pipelines that allow a Spark ML Pipeline to know about itself on every Pipeline Stage task, allowing annotators to utilize to use this same pipeline against external resources to process them in the same way the user decides. Only some of our annotators take advantage of this. RecursivePipeline behaves exactly the same than normal Spark ML pipelines, so they can be used with the same intention.
Example:

import com.johnsnowlabs.nlp.RecursivePipeline
val recursivePipeline = new RecursivePipeline()
      .setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        lemmatizer,
        finisher
      ))

ExternalResource: Data properties outside the pipeline

ExternalResource represent the properties of external data to be read, usually by the ResourceHelper (which is explained below). It contains information into how such external source may be read, and allows different protocols (hdfs, s3, etc) and formats (csv, text, parquet, etc). User does not usually need to create explicitly External Resources, but function parameters usually ask for elements used by it.
Example:

regex_matcher = RegexMatcher() \
    .setStrategy("MATCH_ALL") \
    .setExternalRules(path="/some/path", delimiter=",", read_as=ReadAs.LINE_BY_LINE, options={"format": "parquet"}) \
    .setOutputCol("regex")
Arguments:
  • path -> Takes a path with protocol of desintation file or folder
  • ReadAs -> "LINE_BY_LINE" or "SPARK_DATASET" will tell SparkNLP to use spark or not for this file or folder
  • options -> Contains information passed to Spark reader (e.g. format: "text") and other useful options for annotators (e.g. delimiter)

ExternalResource represent the properties of external data to be read, usually by the ResourceHelper (which is explained below). It contains information into how such external source may be read, and allows different protocols (hdfs, s3, etc) and formats (csv, text, parquet, etc). User does not usually need to create explicitly External Resources, but function parameters usually ask for elements used by it.
Example:

posTagger = new PerceptronApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("pos")
    .setNIterations(1)
    .setCorpus(ExternalResource("/some/path/110CYL067.txt", ReadAs.LINE_BY_LINE, Map("delimiter" -> "|"))))
Arguments:
  • path -> Takes a path with protocol of desintation file or folder
  • ReadAs -> "LINE_BY_LINE" or "SPARK_DATASET" will tell SparkNLP to use spark or not for this file or folder
  • options -> Contains information passed to Spark reader (e.g. format: "text") and other useful options for annotators (e.g. delimiter)

ResourceHelper: Deal with data outside the pipeline

When working with external resources, like training data that is not part of the pipeline process, our annotators use the ResourceHelper to efficiently parse and extract data into specific formats. This class may be utilized for other purposes by the user (Only in Scala)

When working with external resources, like training data that is not part of the pipeline process, our annotators use the ResourceHelper to efficiently parse and extract data into specific formats. This class may be utilized for other purposes by the user.

Functions (not all of them listed):
  • createDatasetFromText(path, includeFilename, includeRowNumber, aggregateByFile) -> Takes file or folder and builds up an aggregated dataset
  • parseKeyValueText(externalResource) -> Parses delimited text with delimiter
  • parseLines(externalResource) -> Parses line by line text
  • parseTupleText(externalResource) -> Parses a text as a delimited tuple
  • parseTupleSentences(externalResource) -> Parses tagged tokens with a specific delimiter
  • wordCount(externalResources) -> Counts appearance of each word in text

EmbeddingsHelper: Deal with word embeddings

Allows loading, saving and setting word embeddings for annotators

Functions (not all of them listed):
  • loadEmbeddings(path, spark, format, dims, caseSensitive) -> Loads embeddings from disk in any format possible: 'TEXT', 'BINARY', 'SPARKNLP'
  • setEmbeddingsRef(ref, embeddings) -> Sets some previously loaded embeddings to be available through the provided 'ref' name by user
  • saveEmbeddings(path, embeddings, spark) -> Saves provided embeddings to path, using current SparkSession
  • clearCache() -> clears reference and cached embeddings

Allows loading, saving and setting word embeddings for annotators

Functions (not all of them listed):
  • loadEmbeddings(path, spark, format, dims, caseSensitive) -> Loads embeddings from disk in any format possible: 'TEXT', 'BINARY', 'SPARKNLP'
  • setEmbeddingsRef(ref, embeddings) -> Sets some previously loaded embeddings to be available through the provided 'ref' name by user
  • saveEmbeddings(path, embeddings, spark) -> Saves provided embeddings to path, using current SparkSession
  • clearCache() -> clears reference and cached embeddings

Annotator with Word Embeddings

Some annotators use word embeddings. This is a common functionality within them.

Functions (not all of them listed):
  • setIncludeEmbeddings(bool) -> Param to define whether or not to include word embeddings when saving this annotator to disk (single or within pipeline)
  • setEmbeddingsRef(ref) -> Set whether to use annotators under the provided name. This means these embeddings will be lookup from the cache by the ref name. This allows multiple annotators to utilize same word embeddings by ref name.

Some annotators use word embeddings. This is a common functionality within them.

Functions (not all of them listed):
  • setIncludeEmbeddings(bool) -> Param to define whether or not to include word embeddings when saving this annotator to disk (single or within pipeline)
  • setEmbeddingsRef(ref) -> Set whether to use annotators under the provided name. This means these embeddings will be lookup from the cache by the ref name. This allows multiple annotators to utilize same word embeddings by ref name.

Params and Features: Annotator parameters

SparkML uses ML Params to store pipeline parameter maps. In SparkNLP, we also use Features, which are a way to store parameter maps that are larger than just a string or a boolean. These features are serialized as either Parquet or RDD objects, allowing much faster and scalable annotator information. Features are also broadcasted among executors for better performance.
Example:

SparkML uses ML Params to store pipeline parameter maps. In SparkNLP, we also use Features, which are a way to store parameter maps that are larger than just a string or a boolean. These features are serialized as either Parquet or RDD objects, allowing much faster and scalable annotator information. Features are also broadcasted among executors for better performance.
Example:

Pretrained Pipelines

BasicPipeline: common NLP annotations

BasicPipeline will easily return to you tokens, normalized tokens, lemmas and part of speech tags. It can take either a Spark dataset or a string or array of strings (LightPipelines behind). It will require internet connection to download it from our servers.
Example:

from sparknlp.pretrained.pipeline.en import BasicPipeline
#Annotate with pipeline
BasicPipeline().annotate("Please parse this sentence. Thanks")
BasicPipeline().annotate(["This is a first sentence", "This is another one"])
BasicPipeline().annotate(dataset, "textColumn")
#Just get the pipeline
BasicPipeline().pretrained()

BasicPipeline will easily return to you tokens, normalized tokens, lemmas and part of speech tags. It can take either a Spark dataset or a string or array of strings (LightPipelines behind). It will require internet connection to download it from our servers.
Example:

import com.johnsnowlabs.nlp.pretrained.pipelines.en.BasicPipeline
//Annotate with pipeline
BasicPipeline().annotate("Please parse this sentence. Thanks")
BasicPipeline().annotate(["This is a first sentence", "This is another one"])
BasicPipeline().annotate(dataset, "textColumn")
//Just get the pipeline
BasicPipeline().pretrained()

AdvancedPipeline: All NLP annotations

Advanced pipelines will return the same than the BasicPipeline, plus Stems, Spell Checked tokens and NER entities using the CRF model. It requires an internet connection to download it from our servers.
Example:

from sparknlp.pretrained.pipeline.en import AdvancedPipeline
#Annotate with pipeline
AdvancedPipeline().annotate("Please parse this sentence. Thanks")
AdvancedPipeline().annotate(["This is a first sentence", "This is another one"])
AdvancedPipeline().annotate(dataset, "textColumn")
#Just get the pipeline
AdvancedPipeline().pretrained()

Advanced pipelines will return the same than the BasicPipeline, plus Stems, Spell Checked tokens and NER entities using the CRF model. It requires an internet connection to download it from our servers.
Example:

import com.johnsnowlabs.nlp.pretrained.pipelines.en.AdvancedPipeline
//Annotate with pipeline
AdvancedPipeline().annotate("Please parse this sentence. Thanks")
AdvancedPipeline().annotate(["This is a first sentence", "This is another one"])
AdvancedPipeline().annotate(dataset, "textColumn")
//Just get the pipeline
AdvancedPipeline().pretrained()

SentimentPipeline: Vivekn Sentiment analysis

This pipeline takes a dataset or text or array of text and computes sentiment analysis with spell checking included
Example:

from sparknlp.pretrained.pipeline.en import SentimentPipeline
#Annotate with pipeline
AdvancedPipeline().annotate("Please parse this sentence. Thanks")
AdvancedPipeline().annotate(["This is a first sentence", "This is another one"])
AdvancedPipeline().annotate(dataset, "textColumn")
#Just get the pipeline
SentimentPipeline().pretrained()

This pipeline takes a dataset or text or array of text and computes sentiment analysis with spell checking included
Example:

import com.johnsnowlabs.nlp.pretrained.pipelines.en.SentimentPipeline
//Annotate with pipeline
SentimentPipeline().annotate(data, "textColumn")
//Just get the pipeline
SentimentPipeline().pretrained()

Pretrained Models

Fast lemmatizer

Lemmatizer trained with AntBNC Free corpus
Example:

from sparknlp.annotator import LemmatizerModel
LemmatizerModel.pretrained()

Lemmatizer trained with AntBNC Free corpus
Example:

import com.johnsnowlabs.nlp.annotator.LemmatizerModel
LemmatizerModel.pretrained()

Fast Norvig Spell Checker

Spell Checker trained with Wikipedia corpus
Example:

from sparknlp.annotator import NorvigSweetingModel
NorvigSweetingModel.pretrained()

Spell Checker trained with Wikipedia corpus
Example:

import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
NorvigSweetingModel.pretrained()

Fast SymmDelete Spell Checker

Symmetric delete Spell Checker trained with Wikipedia corpus
Example:

from sparknlp.annotator import SymmetricDeleteModel
SymmetricDeleteModel.pretrained()

Symmetric delete Spell Checker trained with Wikipedia corpus
Example:

import com.johnsnowlabs.nlp.annotator.SymmetricDeleteModel
SymmetricDeleteModel.pretrained()

Fast Part of Speech

Part of Speech trained with ANC American Corpus
Example:

from sparknlp.annotator import PerceptronModel
PerceptronModel.pretrained()

Part of Speech trained with ANC American Corpus
Example:

import com.johnsnowlabs.nlp.annotator.PerceptronModel
PerceptronModel.pretrained()

Fast CRF Named Entity Recognition

Named entity recognition model trained with Glove embeddings
Example:

from sparknlp.annotator import NerCrfModel
NerCrfModel.pretrained()

Named entity recognition model trained with Glove embeddings
Example:

import com.johnsnowlabs.nlp.annotator.NerCrfModel
NerCrfModel.pretrained()

Fast Deep Learning NER

Named entity recognition deep learning model trained with Glove embeddings
Example:

from sparknlp.annotator import NerDLModel
NerDLModel.pretrained()

Named entity recognition deep learning model trained with Glove embeddings
Example:

import com.johnsnowlabs.nlp.annotator.NerDLModel
NerDLModel.pretrained()

Logistic Regression Assertion Status

Logistic Regression based assertion status model on top of Glove word embeddings
Example:

from sparknlp.annotator import AssertionLogRegModel
AssertionLogRegModel.pretrained()

Logistic Regression based assertion status model on top of Glove word embeddings
Example:

import com.johnsnowlabs.nlp.annotator.AssertionLogRegModel
AssertionLogRegModel.pretrained()

Fast Bi-LSTM Assertion Status

Assertion status trained on TensorFlow and Spark-NLP utilizing Glove word embeddings
Example:

from sparknlp.annotator import AssertionDLModel
AssertionDLModel.pretrained()

Assertion status trained on TensorFlow and Spark-NLP utilizing Glove word embeddings
Example:

import com.johnsnowlabs.nlp.annotator.AssertionDLModel
AssertionDLModel.pretrained()

Transformers

DocumentAssembler: Getting data in

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: it creates the first annotation of type Document which may be used by annotators down the road. It can read either a String column or an Array[String]
Example:

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline
documentAssembler = new DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
Settable parameters are:
  • setInputCol()
  • setOutputCol()
  • setIdCol() -> OPTIONAL: Sring type column with id information
  • setMetadataCol() -> OPTIONAL: Map type column with metadata information
  • setTrimAndClearNewLines(bool) -> Whether to remove new line characters and trim strings. Defaults to true. Useful for later sentence detection if contains multiple lines.

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: it creates the first annotation of type Document which may be used by annotators down the road. It can read either a String column or an Array[String]
Example:

import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
import org.apache.spark.ml.Pipeline
documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
Settable parameters are:
  • setInputCol()
  • setOutputCol()
  • setIdCol() -> OPTIONAL: Sring type column with id information
  • setMetadataCol() -> OPTIONAL: Map type column with metadata information
  • setTrimAndClearNewLines(bool) -> Whether to remove new line characters and trim strings. Defaults to true. Useful for later sentence detection if contains multiple lines.

TokenAssembler: Getting data reshaped

This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
Example:

token_assembler = TokenAssembler() \
  .setInputCols(["normalized"]) \
  .setOutputCol("assembled")
Settable parameters are:
  • setInputCol()
  • setOutputCol()

This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, spell checked, etc, in order to use this document annotation in further annotators.
Example:

token_assembler = TokenAssembler()
  .setInputCols("normalized")
  .setOutputCol("assembled")
Settable parameters are:
  • setInputCol()
  • setOutputCol()

Finisher: Getting data out

Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into string.
Example:

finisher = Finisher() \
  .setInputCols(["sentiment"]) \
  .setIncludeKeys(True)
Settable parameters are:
  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(True) -> Whether to remove intermediate annotations
  • setValueSplitSymbol("#") -> split values within an annotation character
  • setAnnotationSplitSymbol("@") -> split values between annotations character
  • setIncludeMetadata(False) -> Whether to include metadata keys. Sometimes useful in some annotations
  • setOutputAsArray(False) -> Whether to output as Array. Useful as input for other Spark transformers.

Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into string.
Example:

val finisher = new Finisher()
  .setInputCols("token")
Settable parameters are:
  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(true) -> Whether to remove intermediate annotations
  • setValueSplitSymbol("#") -> split values within an annotation character
  • setAnnotationSplitSymbol("@") -> split values between annotations character
  • setIncludeMetadata(False) -> Whether to include metadata keys. Sometimes useful in some annotations
  • setOutputAsArray(false) -> Whether to output as Array. Useful as input for other Spark transformers.

Annotators

Tokenizer: Word tokens

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
Type: Token
Requires: Document
Functions:

  • setTargetPattern: Basic regex rule to identify a candidate for tokenization. Defaults to \S+ which means anything not a space
  • setSuffixPattern: Regex to identify subtokens that are in the end of the token. Regex has to end with \z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setPrefixPattern: Regex to identify subtokens that come in the beginning of the token. Regex has to start with \A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setExtensionPatterns: Array of Regex with groups () to match subtokens within the target pattern. Every group () will become its own separate token. Order matters (later rules will apply first). Its default rules should cover most cases, e.g. part-of-speech as single token
  • addInfixPattern: Add an extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general).
  • setCompositeTokens: Adds a list of compound words to mark for ignore. E.G. New York so it doesnt get split in New and York

Example:

tokenizer = Tokenizer() \
  .setInputCols(["sentences"]) \
  .setOutputCol("token") \
  .addInfixPattern("(\p{L}+)(n't\b)")

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
Type: Token
Requires: Document
Functions:

  • setTargetPattern: Basic regex rule to identify a candidate for tokenization. Defaults to \S+ which means anything not a space
  • setSuffixPattern: Regex to identify subtokens that are in the end of the token. Regex has to end with \z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setPrefixPattern: Regex to identify subtokens that come in the beginning of the token. Regex has to start with \A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setExtensionPatterns: Array of Regex with groups () to match subtokens within the target pattern. Every group () will become its own separate token. Order matters (later rules will apply first). Its default rules should cover most cases, e.g. part-of-speech as single token
  • addInfixPattern: Add an extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general).
  • setCompositeTokens: Adds a list of compound words to mark for ignore. E.G. New York so it doesnt get split in New and York

Example:

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
  .addInfixPattern("(\p{L}+)(n't\b)")

Normalizer: Text cleaning

Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary
Type: Token
Requires: Token
Functions:

  • setPatterns(patterns): Regular expressions list for normalization, defaults [^A-Za-z]
  • setLowercase(value): lowercase tokens, default true
  • setSlangDictionary(path): txt file with delimited words to be transformed into something else
Example:

normalizer = Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary
Type: Token
Requires: Token
Functions:

  • setPatterns(patterns): Regular expressions list for normalization, defaults [^A-Za-z]
  • setLowercase(value): lowercase tokens, default true
  • setSlangDictionary(path): txt file with delimited words to be transformed into something else
Example:

val normalizer = new Normalizer()
  .setInputCols(Array("token"))
  .setOutputCol("normalized")

Stemmer: Hard stems

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word
Type: Token
Requires: Token
Example:

stemmer = Stemmer() \
  .setInputCols(["token"]) \
  .setOutputCol("stem")

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word
Type: Token
Requires: Token
Example:

val stemmer = new Stemmer()
  .setInputCols(Array("token"))
  .setOutputCol("stem")

Lemmatizer: Lemmas

Retrieves lemmas out of words with the objective of returning a base dictionary word
Type: Token
Requires: Token
Input: abduct -> abducted abducting abduct abducts
Functions: --

  • setDictionary(path, keyDelimiter, valueDelimiter, readAs, options): Path and options to lemma dictionary, in lemma vs possible words format. readAs can be LINE_BY_LINE or SPARK_DATASET. options contain option passed to spark reader if readAs is SPARK_DATASET.
Example:

lemmatizer = Lemmatizer() \
  .setInputCols(["token"]) \
  .setOutputCol("lemma") \
  .setDictionary("./lemmas001.txt")

Retrieves lemmas out of words with the objective of returning a base dictionary word
Type: Token
Requires: None
Input: abduct -> abducted abducting abduct abducts
Functions: --

  • setDictionary(path, keyDelimiter, valueDelimiter, readAs, options): Path and options to lemma dictionary, in lemma vs possible words format. readAs can be LINE_BY_LINE or SPARK_DATASET. options contain option passed to spark reader if readAs is SPARK_DATASET.
Example:

val lemmatizer = new Lemmatizer()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
  .setDictionary("./lemmas001.txt")

RegexMatcher: Rule matching

Uses a reference file to match a set of regular expressions and put them inside a provided key. File must be comma separated.
Type: Regex
Requires: Document
Input: "the\s\w+", "followed by 'the'"
Functions:

  • setStrategy(strategy): Can be any of MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
  • setRulesPath(path, delimiter, readAs, options): Path to file containing a set of regex,key pair. readAs can be LINE_BY_LINE or SPARK_DATASET. options contain option passed to spark reader if readAs is SPARK_DATASET.
Example:

regex_matcher = RegexMatcher() \
  .setStrategy("MATCH_ALL") \
  .setOutputCol("regex")

Uses a reference file to match a set of regular expressions and put them inside a provided key. File must be comma separated.
Type: Regex
Requires: Document
Input: "the\s\w+", "followed by 'the'"
Functions:

  • setStrategy(strategy): Can be any of MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
  • setRulesPath(path, delimiter, readAs, options): Path to file containing a set of regex,key pair. readAs can be LINE_BY_LINE or SPARK_DATASET. options contain option passed to spark reader if readAs is SPARK_DATASET.
Example:

val regexMatcher = new RegexMatcher()
  .setStrategy(strategy)
  .setInputCols(Array("document"))
  .setOutputCol("regex")

TextMatcher: Phrase matching

Annotator to match entire phrases provided in a file against a Document
Type: Entity
Requires: Document
Input: hello world, I am looking for you
Functions:

  • setEntities(path, format, options): Provides a file with phrases to match. Default: Looks up path in configuration.
    path: a path to a file that contains the entities in the specified format.
    readAs: the format of the file, can be one of {ReadAs.LINE_BY_LINE, ReadAs.SPARK_DATASET}. Defaults to LINE_BY_LINE.
    options: a map of additional parameters. Defaults to {"format": "text"}.
Example:

entity_extractor = TextMatcher() \
 .setInputCols(["inputCol"])\
 .setOutputCol("entity")\
 .setEntities("/path/to/file/myentities.txt")

Annotator to match entire phrases provided in a file against a Document
Type: Entity
Requires: Document
Input: hello world, I am looking for you
Functions:

  • setEntities(path, format, options): Provides a file with phrases to match. Default: Looks up path in configuration.
    path: a path to a file that contains the entities in the specified format.
    readAs: the format of the file, can be one of {ReadAs.LINE_BY_LINE, ReadAs.SPARK_DATASET}. Defaults to LINE_BY_LINE.
    options: a map of additional parameters. Defaults to {"format": "text"}.
Example:

val entityExtractor = new TextMatcher()
 .setInputCols("inputCol")
 .setOutputCol("entity")
 .setEntities("/path/to/file/myentities.txt")

Chunker: Meaningful phrase matching

This annotator matches a pattern of part-of-speech tags in order to return meaningul phrases from document

Type: Document
Requires: Document
Functions:
  • setRegexParsers(patterns): A list of regex patterns to match chunks, for example: Array("‹DT›?‹JJ›*‹NN›")
  • addRegexParser(patterns): adds a pattern to the current list of chunk patterns, for example: "‹DT›?‹JJ›*‹NN›"
Example:
chunker = Chunker() \
    .setInputCols(["pos"]) \
    .setOutputCol("chunk") \
    .setRegexParsers(["‹NNP›+", "‹DT|PP\\$›?‹JJ›*‹NN›"])

This annotator matches a pattern of part-of-speech tags in order to return meaningul phrases from document

Type: Document
Requires: Document
Functions:
  • setRegexParsers(patterns): A list of regex patterns to match chunks, for example: Array("‹DT›?‹JJ›*‹NN›")
  • addRegexParser(patterns): adds a pattern to the current list of chunk patterns, for example: "‹DT›?‹JJ›*‹NN›"
Example:

val chunker = new Chunker()
  .setInputCols(Array("pos"))
  .setOutputCol("chunks")
  .setRegexParsers(Array("‹NNP›+", "‹DT|PP\\$›?‹JJ›*‹NN›"))

DateMatcher: Date-time parsing

Reads from different forms of date and time expressions and converts them to a provided date format. Extracts only ONE date per sentence. Use with sentence detector for more matches.
Type: Date
Requires: Document
Reads the following kind of dates:

  • 1978-01-28
  • 1984/04/02
  • 1/02/1980
  • 2/28/79
  • The 31st of April in the year 2008
  • Fri, 21 Nov 1997
  • Jan 21, '97
  • Sun, Nov 21
  • jan 1st
  • next thursday
  • last wednesday
  • today
  • tomorrow
  • yesterday
  • next week
  • next month
  • next year
  • day after
  • the day before
  • 0600h
  • 06:00 hours
  • 6pm
  • 5:30 a.m.
  • at 5
  • 12:59
  • 23:59
  • 1988/11/23 6pm
  • next week at 7.30
  • 5 am tomorrow
Functions:
  • setDateFormat(format): SimpleDateFormat standard date formatting. Defaults to yyyy/MM/dd
Example:

date_matcher = DateMatcher() \
  .setOutputCol("date") \
  .setDateFormat("yyyyMM")

Reads from different forms of date and time expressions and converts them to a provided date format. Extracts only ONE date per sentence. Use with sentence detector for more matches.
Type: Date
Requires: Document
Reads the following kind of dates:

  • 1978-01-28
  • 1984/04/02
  • 1/02/1980
  • 2/28/79
  • The 31st of April in the year 2008
  • Fri, 21 Nov 1997
  • Jan 21, '97
  • Sun, Nov 21
  • jan 1st
  • next thursday
  • last wednesday
  • today
  • tomorrow
  • yesterday
  • next week
  • next month
  • next year
  • day after
  • the day before
  • 0600h
  • 06:00 hours
  • 6pm
  • 5:30 a.m.
  • at 5
  • 12:59
  • 23:59
  • 1988/11/23 6pm
  • next week at 7.30
  • 5 am tomorrow
Functions:
  • setDateFormat(format): SimpleDateFormat standard date formatting. Defaults to yyyy/MM/dd
Example:

val dateMatcher = new DateMatcher()
  .setFormat("yyyyMM")
  .setOutputCol("date")

SentenceDetector: Sentence Boundary Detector

Finds sentence bounds in raw text. Applies rules from Pragmatic Segmenter.
Type: Document
Requires: Document
Functions:

  • setCustomBounds(string): Custom sentence separator text
  • setUseCustomOnly(bool): Use only custom bounds without considering those of Pragmatic Segmenter. Defaults to false. Needs customBounds.
  • setUseAbbreviations(bool): Whether to consider abbreviation strategies for better accuracy but slower performance. Defaults to true.
  • setExplodeSentences(bool): Whether to split sentences into different Dataset rows. Useful for higher parallelism in fat rows. Defaults to false.
Example:

sentence_detector = SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

Finds sentence bounds in raw text. Applies rules from Pragmatic Segmenter.
Type: Document
Requires: Document
Functions:

  • setCustomBounds(bounds): Custom sentence separator text
  • setUseCustomOnly(False): Use only custom bounds without considering those of Pragmatic Segmenter
  • setUseAbbreviations(False): Whether to consider abbreviation strategies for better accuracy but slower performance
  • setExplodeSentences(bool): Whether to split sentences into different Dataset rows. Useful for higher parallelism in fat rows. Defaults to false.
Example:

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

POSTagger: Part of speech tagger

Sets a POS tag to each word within a sentence
Type: POS
Input: A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
Requires: Document, Token
Functions:

  • setCorpus(path, delimiter, readAs, options): path to file, delimiter of token and postag, readAs either LINE_BY_LINE or SPARK_DATASET, options passed to reader if SPARK_DATASET
  • setNIterations(number): Number of iterations for training. May improve accuracy but takes longer. Default 5.
  • setPosColumn(colname): Column containing an array of POS Tags matching every token on the line. If set, this column will be used during fit() stage to train from it instead of external corpora
Example:

pos_tagger = PerceptronApproach() \
  .setInputCols(["token", "sentence"]) \
  .setOutputCol("pos") \
  .setCorpusPath("./src/main/resources/anc-pos-corpus") \
  .setIterations(2) \
  .fit(data)

Sets a POS tag to each word within a sentence
Type: POS
Input: A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
Requires: Document, Token
Functions:

  • setCorpus(path, delimiter, readAs, options): path to file, delimiter of token and postag, readAs either LINE_BY_LINE or SPARK_DATASET, options passed to reader if SPARK_DATASET
  • setNIterations(number): Number of iterations for training. May improve accuracy but takes longer. Default 5.
  • setPosColumn(colname): Column containing an array of POS Tags matching every token on the line. If set, this column will be used during fit() stage to train from it instead of external corpora
Example:

val posTagger = new PerceptronApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")
  .fit(data)

ViveknSentimentDetector: Sentiment analysis

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setSentimentCol(colname): Column with sentiment analysis row's result for training. If not set, external sources need to be set instead.
  • setPositiveSource(path, tokenPattern, readAs, options): Path to file or folder with positive sentiment text, with tokenPattern the regex pattern to match tokens in source. readAs either LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to reader
  • setNegativeSource(path, tokenPattern, readAs, options): Path to file or folder with positive sentiment text, with tokenPattern the regex pattern to match tokens in source. readAs either LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to reader
  • setPruneCorpus(true): when training on small data you may want to disable this to not cut off unfrequent words

Input: File or folder of text files of positive and negative data
Example:

sentiment_detector = SentimentDetector() \
    .setInputCols(["lemma", "sentence"]) \
    .setOutputCol("sentiment")

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setSentimentCol(colname): Column with sentiment analysis row's result for training. If not set, external sources need to be set instead.
  • setPositiveSource(path, tokenPattern, readAs, options): Path to file or folder with positive sentiment text, with tokenPattern the regex pattern to match tokens in source. readAs either LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to reader
  • setNegativeSource(path, tokenPattern, readAs, options): Path to file or folder with positive sentiment text, with tokenPattern the regex pattern to match tokens in source. readAs either LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to reader
  • setPruneCorpus(true): when training on small data you may want to disable this to not cut off unfrequent words

Input: File or folder of text files of positive and negative data
Example:

new ViveknSentimentApproach()
      .setInputCols(Array("token", "sentence"))
      .setOutputCol("vivekn")
      .setPositiveSourcePath("./positive/1.txt")
      .setNegativeSourcePath("./negative/1.txt")
      .setCorpusPrune(false)

SentimentDetector: Sentiment analysis

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setDictionary(path, delimiter, readAs, options): path to file with list of inputs and their content, with such delimiter, readAs LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to spark reader.
  • setPositiveMultiplier(double): Defaults to 1.0
  • setNegativeMultiplier(double): Defaults to -1.0
  • setIncrementMultiplier(double): Defaults to 2.0
  • setDecrementMultiplier(double): Defaults to -2.0
  • setReverseMultiplier(double): Defaults to -1.0

Input:
  • superb,positive
  • bad,negative
  • lack of,revert
  • very,increment
  • barely,decrement

Example:

sentiment_detector = SentimentDetector() \
  .setInputCols(["lemma", "sentence"]) \
  .setOutputCol("sentiment")

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setDictionary(path, delimiter, readAs, options): path to file with list of inputs and their content, with such delimiter, readAs LINE_BY_LINE or as SPARK_DATASET. If latter is set, options is passed to spark reader.
  • setPositiveMultiplier(double): Defaults to 1.0
  • setNegativeMultiplier(double): Defaults to -1.0
  • setIncrementMultiplier(double): Defaults to 2.0
  • setDecrementMultiplier(double): Defaults to -2.0
  • setReverseMultiplier(double): Defaults to -1.0

Input:
  • superb,positive
  • bad,negative
  • lack of,revert
  • very,increment
  • barely,decrement

Example:

val sentimentDetector = new SentimentDetector
  .setInputCols(Array("token", "sentence"))
  .setOutputCol("sentiment")

Named Entity Recognition CRF annotator

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. Its inputs are either a labeled dataset with an Annotations column or an external CoNLL 2003 IOB based dataset, and optionally the user can provide both an entities dictionary and a word embeddings file for better accuracy
Type: named_entity
Requires: Document, token, pos
Functions:

  • setExternalDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER and POS annotated file. If this is provided. label column is not needed. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setLabelColumn: If DatasetPath is not provided, this Seq[Annotation] type of column should have labeled data per token
  • setMinEpochs: Minimum number of epochs to train
  • setMaxEpochs: Maximum number of epochs to train
  • setL2: L2 regularization coefficient for CRF
  • setC0: c0 defines decay speed for gradient
  • setLossEps: If epoch relative improvement lass than this vallue, training is stopped
  • setMinW: Features with less weights than this value will be filtered out
  • setEmbeddingsSource:(path, nDims, format) - sets word embeddings options. path - word embeddings file nDims - number of word embeddings dimensions format - format of word embeddings files:
    1 - spark-nlp format.
    2 - text. This format is usually used by Glove
    3 - binary. This format is usually used by Word2Vec
  • setExternalFeatures(path, delimiter, readAs, options): Path to file or folder of line separated file that has something like this: Volvo:ORG with such delimiter, readAs LINE_BY_LINE or SPARK_DATASET with options passed to the latter.
  • setEntities: Array of entities to recognize
  • setVerbose: Verbosity level
  • setRandomSeed: Random seed
Example:

nerTagger = NerCrfApproach()\
  .setInputCols(["sentence", "token", "pos"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMinEpochs(1)\
  .setMaxEpochs(20)\
  .setLossEps(1e-3)\
  .setDicts(["ner-corpus/dict.txt"])\
  .setDatasetPath("eng.train")\
  .setEmbeddingsSource("glove.6B.100d.txt", 100, 2)\
  .setL2(1)\
  .setC0(1250000)\
  .setRandomSeed(0)\
  .setVerbose(2)

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. Its inputs are either a labeled dataset with an Annotations column or an external CoNLL 2003 IOB based dataset, and optionally the user can provide both an entities dictionary and a word embeddings file for better accuracy
Type: named_entity
Requires: Document, token, pos
Functions:

  • setExternalDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER and POS annotated file. If this is provided. label column is not needed. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setLabelColumn: If DatasetPath is not provided, this Seq[Annotation] type of column should have labeled data per token
  • setMinEpochs: Minimum number of epochs to train
  • setMaxEpochs: Maximum number of epochs to train
  • setL2: L2 regularization coefficient for CRF
  • setC0: c0 defines decay speed for gradient
  • setLossEps: If epoch relative improvement lass than this vallue, training is stopped
  • setMinW: Features with less weights than this value will be filtered out
  • setEmbeddingsSource:(path, nDims, format) - sets word embeddings options. path - word embeddings file nDims - number of word embeddings dimensions format - format of word embeddings files:
    1 - spark-nlp format.
    2 - text. This format is usually used by Glove
    3 - binary. This format is usually used by Word2Vec
  • setExternalFeatures(path, delimiter, readAs, options): Path to file or folder of line separated file that has something like this: Volvo:ORG with such delimiter, readAs LINE_BY_LINE or SPARK_DATASET with options passed to the latter.
  • setEntities: Array of entities to recognize
  • setVerbose: Verbosity level
  • setRandomSeed: Random seed
Example:

new NerCrfApproach()
  .setInputCols("sentence", "token", "pos")
  .setLabelColumn("label")
  .setMinEpochs(1)
  .setMaxEpochs(3)
  .setDatasetPath("src/test/resources/ner-corpus/test_ner_dataset.txt")
  .setEmbeddingsSource("src/test/resources/ner-corpus/test_embeddings.txt", 3, WordEmbeddingsFormat.Text)
  .setC0(34)
  .setL2(3.0)
  .setOutputCol("ner")
  .fit(df)

Named Entity Recognition Deep Learning annotator

This Named Entity recognition annotator allows to train generic NER model based on a Neural Networks. Its inputs are either a labeled dataset with an Annotations column or an external CoNLL 2003 IOB based dataset, Also the user have to provide word embeddings file
Neural Network architecture is Char CNN - BLSTM that achieves state-of-the-art in most datasets.
Type: named_entity
Requires: Document, token
Functions:

  • setExternalDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER file. If this is provided. label column is not needed. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setLabelColumn: If DatasetPath is not provided, this Seq[Annotation] type of column should have labeled data per token
  • setMaxEpochs: Maximum number of epochs to train
  • setLr: Initial learning rate
  • setPo: Learning rate decay coefficient. Real Learning Rate: lr / (1 + po * epoch)
  • setBatchSize: Batch size for training
  • setDropout: Dropout coefficient
  • setEmbeddingsSource:(path, nDims, format) - sets word embeddings options. path - word embeddings file nDims - number of word embeddings dimensions format - format of word embeddings files:
    1 - spark-nlp format.
    2 - text. This format is usually used by Glove
    3 - binary. This format is usually used by Word2Vec
  • setValidationDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER file. If provided than quality will be logged after every epoch. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setTestDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER file. If provided than quality will be logged after every epoch. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setVerbose: Verbosity level
  • setRandomSeed: Random seed
Example:

nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(10)\
  .setExternalDataset("file://conll2013/eng.train")\
  .setValidationDataset("file://conll2013/eng.testa")\
  .setTestDataset("file://conll2013/eng.testb")\
  .setEmbeddingsSource("file://glove.6B/glove.6B.100d.txt", 100, 2)\
  .setRandomSeed(0)\
  .setVerbose(2)

This Named Entity recognition annotator allows to train generic NER model based on a Neural Networks. Its inputs are either a labeled dataset with an Annotations column or an external CoNLL 2003 IOB based dataset, Also the user have to provide word embeddings file
Neural Network architecture is Char CNN - BLSTM that achieves state-of-the-art in most datasets.
Type: named_entity
Requires: Document, token
Functions:

  • setExternalDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER file. If this is provided. label column is not needed. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setLabelColumn: If DatasetPath is not provided, this Seq[Annotation] type of column should have labeled data per token
  • setMaxEpochs: Maximum number of epochs to train
  • setLr: Initial learning rate
  • setPo: Learning rate decay coefficient. Real Learning Rate: lr / (1 + po * epoch)
  • setBatchSize: Batch size for training
  • setDropout: Dropout coefficient
  • setEmbeddingsSource:(path, nDims, format) - sets word embeddings options. path - word embeddings file nDims - number of word embeddings dimensions format - format of word embeddings files:
    1 - spark-nlp format.
    2 - text. This format is usually used by Glove
    3 - binary. This format is usually used by Word2Vec
  • setValidationDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER file. If provided than quality will be logged after every epoch. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setTestDataset(path, readAs, options): Path to a CoNLL 2003 IOB NER file. If provided than quality will be logged after every epoch. readAs can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used
  • setVerbose: Verbosity level
  • setRandomSeed: Random seed
Example:

val nerTagger = new NerDLApproach()
      .setInputCols("sentence", "token")
      .setOutputCol("ner")
      .setLabelColumn("label")
      .setMaxEpochs(120)
      .setRandomSeed(0)
      .setPo(0.03f)
      .setLr(0.2f)
      .setDropout(0.5f)
      .setBatchSize(9)
      .setEmbeddingsSource("glove.6B.100d.txt", 100, WordEmbeddingsFormat.TEXT)
      .setExternalDataset("conll2013/eng.train", ReadAs.LINE_BY_LINE, Map.empty[String, String])
      .setValidationDataset("conll2013/eng.testa", ReadAs.LINE_BY_LINE, Map.empty[String, String])
      .setTestDataset("conll2013/eng.testb", ReadAs.LINE_BY_LINE, Map.empty[String, String])
      .setVerbose(Verbose.Epochs)

SpellChecker: Norvig algorithm

This annotator retrieves tokens and makes corrections automatically if not found on an english dictionary
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: Tokenizer
Functions:

  • setDictionary(path, tokenPattern, readAs, options): path to file with properly spelled words, tokenPattern is the regex pattern to identify them in text, readAs LINE_BY_LINE or SPARK_DATASET, with options passed to Spark reader if the latter is set.
  • setCorpus(path, tokenPattern, readAs, options): path to training corpus folder or file. If not set, content available in Dataset received in fit() stage will be used. tokenPattern is the regex pattern to identify them in text, readAs LINE_BY_LINE or SPARK_DATASET, with options passed to Spark reader if the latter is set.
  • setSlangDictionary(path, delimiter, readAs, options): path to custom word mapping for spell checking. e.g. gr8 -> great. Uses provided delimiter, readAs LINE_BY_LINE or SPARK_DATASET with options passed to reader if the latter.
  • setCaseSensitive(boolean): defaults to false. Might affect accuracy
  • setDoubleVariants(boolean): enables extra check for word combinations, more accuracy at performance
  • setShortCircuit(boolean): faster but less accurate mode
  • setWordSizeIgnore(int): Minimum size of word before moving on. Defaults to 3.
  • setDupsLimit(int): Maximum duplicate of characters to account for. Defaults to 2.
  • setReductLimit(int): Word reduction limit. Defaults to 3
  • setIntersections(int): Hamming intersections to attempt. Defaults to 10.
  • setVowelSwapLimit(int): Vowel swap attempts. Defaults to 6.
Example:

spell_checker = NorvigSweetingApproach() \
  .setInputCols(["token"]) \
  .setOutputCol("spell") \
  .setCorpus("./sherlockholmes.txt")

This annotator retrieves tokens and makes corrections automatically if not found on an english dictionary
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: Tokenizer
Functions:

  • setDictionary(path, tokenPattern, readAs, options): path to file with properly spelled words, tokenPattern is the regex pattern to identify them in text, readAs LINE_BY_LINE or SPARK_DATASET, with options passed to Spark reader if the latter is set.
  • setCorpus(path, tokenPattern, readAs, options): path to training corpus folder or file. If not set, content available in Dataset received in fit() stage will be used. tokenPattern is the regex pattern to identify them in text, readAs LINE_BY_LINE or SPARK_DATASET, with options passed to Spark reader if the latter is set.
  • setSlangDictionary(path, delimiter, readAs, options): path to custom word mapping for spell checking. e.g. gr8 -> great. Uses provided delimiter, readAs LINE_BY_LINE or SPARK_DATASET with options passed to reader if the latter.
  • setCaseSensitive(boolean): defaults to false. Might affect accuracy
  • setDoubleVariants(boolean): enables extra check for word combinations, more accuracy at performance
  • setShortCircuit(boolean): faster but less accurate mode
  • setWordSizeIgnore(int): Minimum size of word before moving on. Defaults to 3.
  • setDupsLimit(int): Maximum duplicate of characters to account for. Defaults to 2.
  • setReductLimit(int): Word reduction limit. Defaults to 3
  • setIntersections(int): Hamming intersections to attempt. Defaults to 10.
  • setVowelSwapLimit(int): Vowel swap attempts. Defaults to 6.
Example:

val spellChecker = new NorvigSweetingApproach()
  .setInputCols(Array("normalized"))
  .setOutputCol("spell")
  .setCorpus("./sherlockholmes.txt")

SpellChecker: Symmetric delete

This spell checker is inspired on Symmetric Delete algorithm. It retrieves tokens and utilizes distance metrics to compute possible derived words
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: Tokenizer
Functions:

  • setCorpus(path, tokenPattern, readAs, options): path to training corpus folder or file. If not set, content available in Dataset received in fit() stage will be used. tokenPattern is the regex pattern to identify them in text, readAs LINE_BY_LINE or SPARK_DATASET, with options passed to Spark reader if the latter is set.
  • setDictionary(path, tokenPattern, readAs, options): Optional dictionary of properly written words. If provided, significantly boosts spell checking performance
  • setMaxEditDistance(distance): Maximum edit distance to calculate possible derived words. Defaults to 3.
Example:

spell_checker = SymmetricDeleteApproach() \
  .setInputCols(["token"]) \
  .setOutputCol("spell") \
  .setCorpus("./sherlockholmes.txt")

This spell checker is inspired on Symmetric Delete algorithm. It retrieves tokens and utilizes distance metrics to compute possible derived words
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: Tokenizer
Functions:

  • setCorpus(path, tokenPattern, readAs, options): path to training corpus folder or file. If not set, content available in Dataset received in fit() stage will be used. tokenPattern is the regex pattern to identify them in text, readAs LINE_BY_LINE or SPARK_DATASET, with options passed to Spark reader if the latter is set.
  • setDictionary(path, tokenPattern, readAs, options): Optional dictionary of properly written words. If provided, significantly boosts spell checking performance
  • setMaxEditDistance(distance): Maximum edit distance to calculate possible derived words. Defaults to 3.
Example:

val spellChecker = new SymmetricDeleteApproach()
  .setInputCols(Array("normalized"))
  .setOutputCol("spell")
  .setCorpus("./sherlockholmes.txt")

AssertionStatus: Assertion Status Classifier

Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.
Type: assertion
Requires: Document
Functions:

  • setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column. You don't need to specify them explicitly.
  • setInputCol(document): sets the name of the column that contains the text to be analyzed.
  • setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
  • setBefore(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
  • setAfter(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
  • setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp). An example embeddings file is provided in the appropriate python notebook.

Input: a document as output by the Document Assembler.
Example:


assertion_status = AssertionStatusApproach() \
.setLabelCol("label") \
      .setInputCols("document") \
      .setOutputCol("assertion") \
      .setBefore(11) \
      .setAfter(13) \
      .setEmbeddingsSource(embeddingsFile, 200, 3)

Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.
Type: assertion
Requires: Document
Functions:

  • setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column. You don't need to specify them explicitly.
  • setInputCol(document): sets the name of the column that contains the text to be analyzed.
  • setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
  • setBefore(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
  • setAfter(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
  • setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp). An example embeddings file is provided in the appropriate python notebook.

Input: a document as output by the Document Assembler.
Example:


val assertionStatus = new AssertionStatusApproach() 
.setLabelCol("label") 
      .setInputCols("document") 
      .setOutputCol("assertion") 
      .setBefore(11) 
      .setAfter(13)
      .setEmbeddingsSource(embeddingsFile, 200, WordEmbeddingsFormat.Binary)

Assertion Status: Assertion Status Deep Learning Classifier

Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.
Type: assertion
Requires: Document
Functions:

  • setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column. You don't need to specify them explicitly.
  • setInputCols(document): sets the name of the column that contains the text to be analyzed.
  • setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
  • setBatchSize(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
  • setEpochs(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
  • setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp). An example embeddings file is provided in the appropriate python notebook.

Input: a document as output by the Document Assembler.
Example:


assertion_status = AssertionStatusApproach() \
.setLabelCol("label") \
      .setInputCols("document") \
      .setOutputCol("assertion") \
      .setBatchSize(16) \
      .setEpochs(5) \
      .setEmbeddingsSource(embeddingsFile, 200, 3)

Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.
Type: assertion
Requires: Document
Functions:

  • setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column. You don't need to specify them explicitly.
  • setInputCols(document): sets the name of the column that contains the text to be analyzed.
  • setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
  • setBatchSize(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
  • setEpochs(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
  • setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp). An example embeddings file is provided in the appropriate python notebook.

Input: a document as output by the Document Assembler.
Example:


val assertionStatus = new AssertionStatusApproach() 
.setLabelCol("label") 
      .setInputCols("document") 
      .setOutputCol("assertion") 
      .setBatchSize(16) 
      .setEpochs(5)
      .setEmbeddingsSource(embeddingsFile, 200, WordEmbeddingsFormat.Binary)