SparkNLP- Annotators

Last updated:

Annotators

1. DocumentAssembler: Getting data in

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: the DocumentAssembler, it creates the first annotation of type Document which may be used by annotators down the road
Example:

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline
documentAssembler = new DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
Settable parameters are:
  • setInputCol()
  • setOutputCol()
  • setIdCol() -> OPTIONAL: Sring type column with id information
  • setMetadataCol() -> OPTIONAL: Map type column with metadata information

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: the DocumentAssembler, it creates the first annotation of type Document which may be used by annotators down the road
Example:

import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
import org.apache.spark.ml.Pipeline
documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
Settable parameters are:
  • setInputCol()
  • setOutputCol()
  • setIdCol() -> OPTIONAL: Sring type column with id information
  • setMetadataCol() -> OPTIONAL: Map type column with metadata information

2. Tokenizer: Word tokens

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
Type: Token
Requires: Document
Functions:

  • setTargetPattern: Basic regex rule to identify a candidate for tokenization. Defaults to \S+ which means anything not a space
  • setSuffixPattern: Regex to identify token characters that are in the end of the token. Regex has to end with \z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setPrefixPattern: Regex to identify token characters that come in the beggining of the token. Regex has to start with \A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setExtensionPatterns: Array of Regex with groups () to match tokens within the target pattern. Every group () will become its own separate token. Order matters (later rules will apply first). Its default rules should cover most cases, e.g. part-of-speech as single token
  • addInfixPattern: Add an extension pattern regex with groups to the top of the rules (will target first, then the others).
  • setCompositeTokens: Adds a list of compound words to mark for ignore. E.G. New York so it doesnt get split in New and York

Example:

tokenizer = Tokenizer() \
  .setInputCols(["sentences"]) \
  .setOutputCol("token") \
  .addInfixPattern("(\p{L}+)(n't\b)")

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
Type: Token
Requires: Document
Functions:

  • setTargetPattern: Basic regex rule to identify a candidate for tokenization. Defaults to \S+ which means anything not a space
  • setSuffixPattern: Regex to identify token characters that are in the end of the token. Regex has to end with \z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setPrefixPattern: Regex to identify token characters that come in the beggining of the token. Regex has to start with \A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
  • setExtensionPatterns: Array of Regex with groups () to match tokens within the target pattern. Every group () will become its own separate token. Order matters (later rules will apply first). Its default rules should cover most cases, e.g. part-of-speech as single token
  • addInfixPattern: Add an extension pattern regex with groups to the top of the rules (will target first, then the others).
  • setCompositeTokens: Adds a list of compound words to mark for ignore. E.G. New York so it doesnt get split in New and York

Example:

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
  .addInfixPattern("(\p{L}+)(n't\b)")

3. Normalizer: Text cleaning

Removes all dirty characters from text
Type: Token
Requires: Token
Functions:

  • setPattern(pattern): Regular expression for normalization, defaults [^A-Za-z] setLowercase(value): lowercase tokens, default true
Example:

normalizer = Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

Removes all dirty characters from text
Type: Token
Requires: Token
Functions:

  • setPattern(pattern): Regular expression for normalization, defaults [^A-Za-z]
Example:

val normalizer = new Normalizer()
  .setInputCols(Array("token"))
  .setOutputCol("normalized")

4. Stemmer: Hard stems

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word
Type: Token
Requires: Token
Example:

stemmer = Stemmer() \
  .setInputCols(["token"]) \
  .setOutputCol("stem")

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word
Type: Token
Requires: Token
Example:

val stemmer = new Stemmer()
  .setInputCols(Array("token"))
  .setOutputCol("stem")

5. Lemmatizer: Lemmas

Retrieves lemmas out of words with the objective of returning a base dictionary word
Type: Token
Requires: Token
Input: abduct -> abducted abducting abduct abducts
Functions: --

  • setDictionary(path): Path to file containing multiple key to value dictionary, or key,value lemma dictionary. Default: Not provided
  • setLemmaFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: Looks up path in configuration
  • setLemmaKeySep(format): Separator for keys and multiple values Default: "->" or Looks up path in configuration
  • setLemmaValSep(format): Separator among values Default: "\t" or Looks up path in configuration
Example:

lemmatizer = Lemmatizer() \
  .setDocumentCol("document") \
  .setInputCols(["token"]) \
  .setOutputCol("lemma") \
  .setDictionary("./lemmas001.txt")

Retrieves lemmas out of words with the objective of returning a base dictionary word
Type: Token
Requires: None
Input: abduct -> abducted abducting abduct abducts
Functions: --

  • setDictionary(path): Path to file containing multiple key to value dictionary, or key,value lemma dictionary. Default: Not provided
  • setLemmaFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: Looks up path in configuration
  • setLemmaKeySep(format): Separator for keys and multiple values Default: "->" or Looks up path in configuration
  • setLemmaValSep(format): Separator among values Default: "\t" or Looks up path in configuration
Example:

val lemmatizer = new Lemmatizer()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
  .setLemmaDict("./lemmas001.txt")

6. RegexMatcher: Rule matching

Uses a reference file to match a set of regular expressions and put them inside a provided key. File must be comma separated.
Type: Regex
Requires: Document
Input: "the\s\w+", "followed by 'the'"
Functions:

  • setStrategy(strategy): Can be any of MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
  • setRulesPath(path): Path to file containing a set of regex,key pair. Default: Looks up path in configuration
  • setRulesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: TXT or looks up path in configuration
  • setRulesSeparator(sep): Separator for rules file Default: "," or looks up path in configuration
Example:

regex_matcher = RegexMatcher() \
  .setStrategy("MATCH_ALL") \
  .setOutputCol("regex")

Uses a reference file to match a set of regular expressions and put them inside a provided key. File must be comma separated.
Type: Regex
Requires: Document
Input: "the\s\w+", "followed by 'the'"
Functions:

  • setStrategy(strategy): Can be any of MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
  • setRulesPath(path): Path to file containing a set of regex,key pair. Default: Looks up path in configuration
  • setRulesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: TXT or looks up path in configuration
  • setRulesSeparator(sep): Separator for rules file Default: "," or looks up path in configuration
Example:

val regexMatcher = new RegexMatcher()
  .setStrategy(strategy)
  .setInputCols(Array("document"))
  .setOutputCol("regex")

7. EntityExtractor: Phrase matching

Annotator to match entire phrases provided in a file against a Document
Type: Entity
Requires: Document
Input: hello world, I am looking for you
Functions:

  • setRequireSentences(false): Enable require SBD and utilizes sentence boundaries for better precision
  • setEntities(path, format, options): Provides a file with phrases to match. Default: Looks up path in configuration.
    path: a path to a file that contains the entities in the specified format.
    format: the format of the file, can be one of {ReadAs.LINE_BY_LINE, ReadAs.SPARK_DATASET}. Defaults to LINE_BY_LINE.
    options: a map of additional parameters. Defaults to {"format": "text"}.
Example:

entity_extractor = EntityExtractor() \
 .setInputCols(["inputCol"])\
 .setOutputCol("entity")\
 .setEntities("/path/to/file/myentities.txt")

Annotator to match entire phrases provided in a file against a Document
Type: Entity
Requires: Document
Input: hello world, I am looking for you
Functions:

  • setRequireSentences(false): Enable require SBD and utilizes sentence boundaries for better precision
  • setEntities(path, format, options): Provides a file with phrases to match. Default: Looks up path in configuration.
    path: a path to a file that contains the entities in the specified format.
    format: the format of the file, can be one of {ReadAs.LINE_BY_LINE, ReadAs.SPARK_DATASET}. Defaults to LINE_BY_LINE.
    options: a map of additional parameters. Defaults to Map("format" -> "text").
Example:

val entityExtractor = new EntityExtractor()
 .setInputCols("inputCol")
 .setOutputCol("entity")
 .setEntities("/path/to/file/myentities.txt")

8. DateMatcher: Date-time parsing

Reads from different forms of date and time expressions and converts them to a provided date format
Type: Date
Requires: Document
Functions:

  • setDateFormat(format): SimpleDateFormat standard date formatting. Defaults to yyyy/MM/dd
Example:

date_matcher = DateMatcher() \
  .setOutputCol("date") \
  .setDateFormat("yyyyMM")

Reads from different forms of date and time expressions and converts them to a provided date format
Type: Date
Requires: Document
Functions:

  • setDateFormat(format): SimpleDateFormat standard date formatting. Defaults to yyyy/MM/dd
Example:

val dateMatcher = new DateMatcher()
  .setFormat("yyyyMM")
  .setOutputCol("date")

9. SentenceDetector: Sentence Boundary Detector

Finds sentence bounds in raw text.
Type: Document
Requires: Document
Functions:

  • setCustomBounds(characters): Custom sentence separator text
Example:

sentence_detector = SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") \
  .setUseAbbreviations(True)

Finds sentence bounds in raw text.
Type: Document
Requires: Document
Functions:

  • setCustomBounds(characters): Custom sentence separator text
Example:

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

10. POSTagger: Part of speech tagger

Sets a POS tag to each word within a sentence
Type: POS
Input: A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
Requires: Document, Token
Functions:

  • setCorpusPath: Path to a pipe separated file or directory with correct word|tag in sentence per line
  • setNIterations: Number of iterations for training. May improve accuracy but takes longer. Default 5.
  • setCorpusFormat: TXT or TXTDS, the latter will read corpusPath as a Spark dataset, which should work better for bigger corpora.
  • setCorpusLimit: When path is a folder, sets a limit for the amount of files being read.
Example:

pos_tagger = PerceptronApproach() \
  .setInputCols(["token", "sentence"]) \
  .setOutputCol("pos") \
  .setCorpusPath("./src/main/resources/anc-pos-corpus") \
  .setIterations(2) \
  .fit(data)

Sets a POS tag to each word within a sentence
Type: POS
Input: A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
Requires: Document, Token
Functions:

  • setCorpusPath: Path to a pipe separated file or directory with correct word|tag in sentence per line
  • setIterations: Number of iterations for training. May improve accuracy but takes longer. Default 5.
  • setCorpusFormat: TXT or TXTDS, the latter will read corpusPath as a Spark dataset, which should work better for bigger corpora.
  • setCorpusLimit: When path is a folder, sets a limit for the amount of files being read.
Example:

val posTagger = new PerceptronApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")
  .fit(data)

11. ViveknSentimentDetector: Sentiment analysis

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setPositiveSource(path)
  • setNegativeSource(path)
  • setPruneCorpus(true): when training on small data you may want to disable this to not cut off unfrequent words

Input: File or folder of text files of positive and negative data
Example:

sentiment_detector = SentimentDetector() \
    .setInputCols(["lemma", "sentence"]) \
    .setOutputCol("sentiment")

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setPositiveSourcePath(path)
  • setNegativeSourcePath(path)
  • setCorpusPrune(true): when training on small data you may want to disable this to not cut off unfrequent words

Input: File or folder of text files of positive and negative data
Example:

new ViveknSentimentApproach()
      .setInputCols(Array("token", "sentence"))
      .setOutputCol("vivekn")
      .setPositiveSourcePath("./positive/1.txt")
      .setNegativeSourcePath("./negative/1.txt")
      .setCorpusPrune(false)

12. SentimentDetector: Sentiment analysis

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setDictPath(path)
  • setDictFormat(path)
  • setDictSeparator(path)

Input:
  • superb,positive
  • bad,negative
  • lack of,revert
  • very,increment
  • barely,decrement

Example:

sentiment_detector = SentimentDetector() \
  .setInputCols(["lemma", "sentence"]) \
  .setOutputCol("sentiment")

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setDictPath(path)
  • setDictFormat(path)
  • setDictSeparator(path)

Input:
  • superb,positive
  • bad,negative
  • lack of,revert
  • very,increment
  • barely,decrement

Example:

val sentimentDetector = new SentimentDetector
  .setInputCols(Array("token", "sentence"))
  .setOutputCol("sentiment")

13. Named Entity Recognition CRF annotator

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. Its inputs are either a labeled dataset with an Annotations column or an external CoNLL 2003 IOB based dataset, and optionally the user can provide both an entities dictionary and a word embeddings file for better accuracy
Type: named_entity
Requires: Document, token, pos
Functions:

  • setDatasetPath: Path to a CoNLL 2003 IOB NER and POS annotated file (https://www.clips.uantwerpen.be/conll2003/ner). If this is provided. label column is not needed.
  • setLabelColumn: If DatasetPath is not provided, this Seq[Annotation] type of column should have labeled data per token
  • setMinEpochs: Minimum number of epochs to train
  • setMaxEpochs: Maximum number of epochs to train
  • setL2: L2 regularization coefficient for CRF
  • setC0: c0 defines decay speed for gradient
  • setLossEps: If epoch relative improvement lass than this vallue, training is stopped
  • setMinW: Features with less weights than this value will be filtered out
  • setEmbeddingsSource:(path, nDims, format) - sets word embeddings options https://en.wikipedia.org/wiki/Word_embedding. path - word embeddings file nDims - number of word embeddings dimentions format - format of word embeddings files: 1 - spark-nlp format. 2 - text. This format is usually used by Glove https://nlp.stanford.edu/projects/glove/ 3 - binary. This format is usually used by Word2Vec https://code.google.com/archive/p/word2vec/
  • setDicts: Array of additional dictionary paths to use as features. A dict .txt files is a line separated file that has something like this: Volvo:ORG
  • setEntities: Array of entities to recognize
  • setVerbose: Verbosity level
  • setRandomSeed: Random seed
Example:

nerTagger = NerCrfApproach()\
  .setInputCols(["sentence", "token", "pos"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMinEpochs(1)\
  .setMaxEpochs(20)\
  .setLossEps(1e-3)\
  .setDicts(["ner-corpus/dict.txt"])\
  .setDatasetPath("eng.train")\
  .setEmbeddingsSource("glove.6B.100d.txt", 100, 2)\
  .setL2(1)\
  .setC0(1250000)\
  .setRandomSeed(0)\
  .setVerbose(2)

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. Its inputs are either a labeled dataset with an Annotations column or an external CoNLL 2003 IOB based dataset, and optionally the user can provide both an entities dictionary and a word embeddings file for better accuracy
Type: named_entity
Requires: Document, token, pos
Functions:

  • setDatasetPath: Path to a CoNLL 2003 IOB NER and POS annotated file (https://www.clips.uantwerpen.be/conll2003/ner). If this is provided. label column is not needed.
  • setLabelColumn: If DatasetPath is not provided, this Seq[Annotation] type of column should have labeled data per token
  • setMinEpochs: Minimum number of epochs to train
  • setMaxEpochs: Maximum number of epochs to train
  • setL2: L2 regularization coefficient for CRF
  • setC0: c0 defines decay speed for gradient
  • setLossEps: If epoch relative improvement lass than this vallue, training is stopped
  • setMinW: Features with less weights than this value will be filtered out
  • setEmbeddingsSource:(path, nDims, format) - sets word embeddings options https://en.wikipedia.org/wiki/Word_embedding. path - word embeddings file nDims - number of word embeddings dimentions format - format of word embeddings files from com.johnsnowlabs.nlp.embeddings.WordEmbeddingsFormat: SparkNlp -> For Spark-NLP own embeddings format file Text -> This format is usually used by Glove https://nlp.stanford.edu/projects/glove/ Binary -> This format is usually used by Word2Vec https://code.google.com/archive/p/word2vec/
  • setDicts: Array of additional dictionary paths to use as features. A dict .txt files is a line separated file that has something like this: Volvo:ORG
  • setEntities: Array of entities to recognize
  • setVerbose: Verbosity level
  • setRandomSeed: Random seed
Example:

new NerCrfApproach()
  .setInputCols("sentence", "token", "pos")
  .setLabelColumn("label")
  .setMinEpochs(1)
  .setMaxEpochs(3)
  .setDatasetPath("src/test/resources/ner-corpus/test_ner_dataset.txt")
  .setEmbeddingsSource("src/test/resources/ner-corpus/test_embeddings.txt", 3, WordEmbeddingsFormat.Text)
  .setC0(34)
  .setL2(3.0)
  .setOutputCol("ner")
  .fit(df)

14. SpellChecker: Token spell corrector

This annotator retrieves tokens and makes corrections automatically if not found on an english dictionary
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: Tokenizer
Functions:

  • setDictPath: path to english words dictionary
  • setCorpusPath: path to training corpus. Can be any good text.
  • setCorpusFormat(format): Allowed “txt” or “txtds”. The latter uses spark dataframes from text
  • setSlangPath: path to custom dictionares, separated by comma
  • setCaseSensitive: defaults to false. Might affect accuracy
  • setDoubleVariants: enables extra check for word combinations, more accuracy at performance
  • setShortCircuit: faster but less accurate mode
Example:

spell_checker = NorvigSweetingApproach() \
  .setInputCols(["token"]) \
  .setOutputCol("spell") \
  .setCorpusPath("./sherlockholmes.txt")

This annotator retrieves tokens and makes corrections automatically if not found on an english dictionary
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: Tokenizer
Functions:

  • setDictPath: path to english words dictionary
  • setCorpusPath: path to training corpus. Can be any good text.
  • setCorpusFormat(format): Allowed “txt” or “txtds”. The latter uses spark dataframes from text
  • setSlangPath: path to custom dictionares, separated by comma
  • setCaseSensitive: defaults to false. Might affect accuracy
  • setDoubleVariants: enables extra check for word combinations, more accuracy at performance
  • setShortCircuit: faster but less accurate mode
Example:

val spellChecker = new NorvigSweetingApproach()
  .setInputCols(Array("normalized"))
  .setOutputCol("spell")
  .setCorpusPath("./sherlockholmes.txt")

15. AssertionStatus: Assertion Status Classifier

Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.
Type: assertion
Requires: Document
Functions:

  • setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column. You don't need to specify them explicitly.
  • setInputCol(document): sets the name of the column that contains the text to be analyzed.
  • setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
  • setBefore(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
  • setAfter(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
  • setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp). An example embeddings file is provided in the appropriate python notebook.

Input: a document as output by the Document Assembler.
Example:


assertion_status = AssertionStatusApproach() \
.setLabelCol("label") \
      .setInputCols("document") \
      .setOutputCol("assertion") \
      .setBefore(11) \
      .setAfter(13) \
      .setEmbeddingsSource(embeddingsFile, 200, 3)

Assigns an assertion status to a target within a sentence. For example, in the sentence "there's no intention to evacuate the area", considering "intention to evacuate the area" as a target, a possible status could be "Negated". This annotator allows you to specify a text, a target, and a set of possible labels describing the assertion status.
Type: assertion
Requires: Document
Functions:

  • setLabelCol(name): sets the name of the column that contains the label for the assertion. The set of labels is inferred from the values present in this column. You don't need to specify them explicitly.
  • setInputCol(document): sets the name of the column that contains the text to be analyzed.
  • setOutputCol(name): this is where the annotations with the label will be after the algorithm runs.
  • setBefore(n): specifies the number of context tokens before the target term(s) that will be used in the algorithm.
  • setAfter(m): specifies the number of context tokens after the first token of the target term(s) that will be used in the algorithm.
  • setEmbeddingsSource(path, size, format): specifies the path to the embeddings file(string), the size of the vectors(integer), and the format of the file(one of the constants Text, Binary, SparkNlp). An example embeddings file is provided in the appropriate python notebook.

Input: a document as output by the Document Assembler.
Example:


val assertionStatus = new AssertionStatusApproach() 
.setLabelCol("label") 
      .setInputCols("document") 
      .setOutputCol("assertion") 
      .setBefore(11) 
      .setAfter(13)
      .setEmbeddingsSource(embeddingsFile, 200, WordEmbeddingsFormat.Binary)

16. Finisher: Getting data out

Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into string.
Example:

finisher = Finisher() \
  .setInputCols(["sentiment"]) \
  .setIncludeKeys(True)
Settable parameters are:
  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(True) -> Whether to remove intermediate annotations
  • setValueSplitSymbol("#") -> split values within an annotation character
  • setAnnotationSplitSymbol("@") -> split values between annotations character
  • setIncludeKeys(False) -> Whether to include metadata keys. Sometimes useful in some annotations
  • setOutputAsArray(False) -> Whether to output as Array. Useful as input for other Spark transformers.

Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into string.
Example:

val finisher = new Finisher()
  .setInputCols("token")
Settable parameters are:
  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(true) -> Whether to remove intermediate annotations
  • setValueSplitSymbol("#") -> split values within an annotation character
  • setAnnotationSplitSymbol("@") -> split values between annotations character
  • setIncludeKeys(false) -> Whether to include metadata keys. Sometimes useful in some annotations
  • setOutputAsArray(false) -> Whether to output as Array. Useful as input for other Spark transformers.

17. TokenAssembler: Getting data reshaped

This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
Example:

token_assembler = TokenAssembler() \
.setInputCols(["normalized"]) \
.setOutputCol("assembled")
Settable parameters are:
  • setInputCol()
  • setOutputCol()

This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, spell checked, etc, in order to use this document annotation in further annotators.
Example:

token_assembler = TokenAssembler()
.setInputCols("normalized")
.setOutputCol("assembled")
Settable parameters are:
  • setInputCol()
  • setOutputCol()