SparkNLP- Annotators

Last updated:

Annotators

1. DocumentAssembler: Getting data in

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: the DocumentAssembler, it creates the first annotation of type Document which may be used by annotators down the road
Example:

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline
documentAssembler = new DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
Settable parameters are:
  • setInputCol()
  • setOutputCol()
  • setIdCol() -> OPTIONAL: Sring type column with id information
  • setMetadataCol() -> OPTIONAL: Map type column with metadata information

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: the DocumentAssembler, it creates the first annotation of type Document which may be used by annotators down the road
Example:

import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
import org.apache.spark.ml.Pipeline
documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
Settable parameters are:
  • setInputCol()
  • setOutputCol()
  • setIdCol() -> OPTIONAL: Sring type column with id information
  • setMetadataCol() -> OPTIONAL: Map type column with metadata information

2. RegexTokenizer: Word tokens

Creates tokens starting up from a regular expression. Defaults to "\w+"
Type: Token
Requires: Document
Functions:

  • setPattern(pattern): Regular expression for tokenization, default \S+

Example:

tokenizer = RegexTokenizer() \
  .setInputCols(["sentences"]) \
  .setOutputCol("token")

Creates tokens starting up from a regular expression. "\w+"
Type: Token
Requires: Document
Functions:

  • setPattern(pattern): Regular expression for tokenization, default \S+

Example:

val regexTokenizer = new RegexTokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

3. Normalizer: Text cleaning

Removes all dirty characters from text
Type: Token
Requires: Token
Functions:

  • setPattern(pattern): Regular expression for normalization, defaults [^A-Za-z]
Example:

normalizer = Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

Removes all dirty characters from text
Type: Token
Requires: Token
Functions:

  • setPattern(pattern): Regular expression for normalization, defaults [^A-Za-z]
Example:

val normalizer = new Normalizer()
  .setInputCols(Array("token"))
  .setOutputCol("normalized")

4. Stemmer: Hard stems

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word
Type: Token
Requires: Token
Example:

stemmer = Stemmer() \
  .setInputCols(["token"]) \
  .setOutputCol("stem")

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word
Type: Token
Requires: Token
Example:

val stemmer = new Stemmer()
  .setInputCols(Array("token"))
  .setOutputCol("stem")

5. Lemmatizer: Lemmas

Retrieves lemmas out of words with the objective of returning a base dictionary word
Type: Token
Requires: Token
Input: abduct -> abducted abducting abduct abducts
Functions: --

  • setDictionary(path): Path to file containing multiple key to value dictionary, or key,value lemma dictionary. Default: Not provided
  • setLemmaFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: Looks up path in configuration
  • setLemmaKeySep(format): Separator for keys and multiple values Default: "->" or Looks up path in configuration
  • setLemmaValSep(format): Separator among values Default: "\t" or Looks up path in configuration
Example:

lemmatizer = Lemmatizer() \
  .setDocumentCol("document") \
  .setInputCols(["token"]) \
  .setOutputCol("lemma") \
  .setDictionary("./lemmas001.txt")

Retrieves lemmas out of words with the objective of returning a base dictionary word
Type: Token
Requires: None
Input: abduct -> abducted abducting abduct abducts
Functions: --

  • setDictionary(path): Path to file containing multiple key to value dictionary, or key,value lemma dictionary. Default: Not provided
  • setLemmaFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: Looks up path in configuration
  • setLemmaKeySep(format): Separator for keys and multiple values Default: "->" or Looks up path in configuration
  • setLemmaValSep(format): Separator among values Default: "\t" or Looks up path in configuration
Example:

val lemmatizer = new Lemmatizer()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
  .setLemmaDict("./lemmas001.txt")

6. RegexMatcher: Rule matching

Uses a reference file to match a set of regular expressions and put them inside a provided key. File must be comma separated.
Type: Regex
Requires: Document
Input: "the\s\w+", "followed by 'the'"
Functions:

  • setStrategy(strategy): Can be any of MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
  • setRulesPath(path): Path to file containing a set of regex,key pair. Default: Looks up path in configuration
  • setRulesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: TXT or looks up path in configuration
  • setRulesSeparator(sep): Separator for rules file Default: "," or looks up path in configuration
Example:

regex_matcher = RegexMatcher() \
  .setStrategy("MATCH_ALL") \
  .setOutputCol("regex")

Uses a reference file to match a set of regular expressions and put them inside a provided key. File must be comma separated.
Type: Regex
Requires: Document
Input: "the\s\w+", "followed by 'the'"
Functions:

  • setStrategy(strategy): Can be any of MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
  • setRulesPath(path): Path to file containing a set of regex,key pair. Default: Looks up path in configuration
  • setRulesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: TXT or looks up path in configuration
  • setRulesSeparator(sep): Separator for rules file Default: "," or looks up path in configuration
Example:

val regexMatcher = new RegexMatcher()
  .setStrategy(strategy)
  .setInputCols(Array("document"))
  .setOutputCol("regex")

7. EntityExtractor: Phrase matching

Annotator to match entire phrases provided in a file against a Document
Type: Entity
Requires: Document
Input: hello world, I am looking for you
Functions:

  • setMaxLen(number): Maximum amount of words to look from relative position
  • setRequireSentences(false): Enable require SBD and utilizes sentence boundaries for better precision
  • setEntitiesPath(path): Provides a file with phrases to match. Default: Looks up path in configuration
  • setEntitiesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: TXT or looks up path in configuration
Example:

entity_extractor = EntityExtractor() \
  .setMaxLen(4) \
  .setOutputCol("entity")

Annotator to match entire phrases provided in a file against a Document
Type: Entity
Requires: Document
Input: hello world, I am looking for you
Functions:

  • setMaxLen(number): Maximum amount of words to look from relative position
  • setRequireSentences(false): Enable require SBD and utilizes sentence boundaries for better precision
  • setEntitiesPath(path): Provides a file with phrases to match. Default: Looks up path in configuration
  • setEntitiesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs) Default: TXT or looks up path in configuration
Example:

val entityExtractor = new EntityExtractor()
  .setMaxLen(4)
  .setOutputCol("entity")

8. DateMatcher: Date-time parsing

Reads from different forms of date and time expressions and converts them to a provided date format
Type: Date
Requires: Document
Functions:

  • setDateFormat(format): SimpleDateFormat standard date formatting. Defaults to yyyy/MM/dd
Example:

date_matcher = DateMatcher() \
  .setOutputCol("date") \
  .setDateFormat("yyyyMM")

Reads from different forms of date and time expressions and converts them to a provided date format
Type: Date
Requires: Document
Functions:

  • setDateFormat(format): SimpleDateFormat standard date formatting. Defaults to yyyy/MM/dd
Example:

val dateMatcher = new DateMatcher()
  .setFormat("yyyyMM")
  .setOutputCol("date")

9. SentenceDetector: Sentence Boundary Detector

Finds sentence bounds in raw text.
Type: Document
Requires: Document
Functions:

  • setCustomBounds(characters): Custom sentence separator text
Example:

sentence_detector = SentenceDetectorModel() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") \
  .setUseAbbreviations(True)

Finds sentence bounds in raw text.
Type: Document
Requires: Document
Functions:

  • setCustomBounds(characters): Custom sentence separator text
Example:

val sentenceDetector = new SentenceDetectorModel()
  .setInputCols("document")
  .setOutputCol("sentence")

10. POSTagger: Part of speech tagger

Sets a POS tag to each word within a sentence
Type: POS
Input: A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
Requires: Document, Token
Functions:

  • setCorpusPath: Path to a pipe separated file or directory with correct word|tag in sentence per line
  • setNIterations: Number of iterations for training. May improve accuracy but takes longer. Default 5.
Example:

pos_tagger = PerceptronApproach() \
  .setInputCols(["token", "sentence"]) \
  .setOutputCol("pos") \
  .setCorpusPath("./src/main/resources/anc-pos-corpus") \
  .setIterations(2) \
  .fit(data)

Sets a POS tag to each word within a sentence
Type: POS
Input: A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
Requires: Document, Token
Functions:

  • setCorpusPath: Path to a pipe separated file or directory with correct word|tag in sentence per line
  • setIterations: Number of iterations for training. May improve accuracy but takes longer. Default 5.
Example:

val posTagger = new PerceptronApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")
  .fit(data)

11. SentimentDetector: Sentiment analysis

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setDictPath(path)
  • setDictFormat(path)
  • setDictSeparator(path)

Input:
  • superb,positive
  • bad,negative
  • lack of,revert
  • very,increment
  • barely,decrement

Example:

sentiment_detector = SentimentDetectorModel() \
  .setInputCols(["lemma", "sentence"]) \
  .setOutputCol("sentiment")

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setDictPath(path)
  • setDictFormat(path)
  • setDictSeparator(path)

Input:
  • superb,positive
  • bad,negative
  • lack of,revert
  • very,increment
  • barely,decrement

Example:

val sentimentDetector = new SentimentDetectorModel
  .setInputCols(Array("token", "sentence"))
  .setOutputCol("sentiment")

12. NERTagger: Named Entity Recognition annotator

The NER annotator extracts entities like Proper noun from the document.
Type: named_entity
Requires: Document
Functions:

  • setCorpusPath: Path to text files containing training entities
Example:

ner_tagger = NERRegexApproach() \
  .setInputCols(["sentence"]) \
  .setOutputCol("NER")

The NER annotator extracts entities like Proper noun from the document.
Type: named_entity
Requires: Document
Functions:

  • setCorpusPath: Path to text files containing training entities
Example:

val nerTagger = new NERRegexApproach()
  .setInputCols(Array("sentence"))
  .setOutputCol("ner")
  .setCorpusPath("/ner-corpus/dict.txt")

13. SpellChecker: Token spell corrector

This annotator retrieves tokens and makes corrections automatically if not found on an english dictionary
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: RegexTokenizer
Functions:

  • setDictPath: path to english words dictionary
  • setCorpusPath: path to training corpus. Can be any good text.
  • setCorpusFormat(format): Allowed “txt” or “txtds”. The latter uses spark dataframes from text
  • setSlangPath: path to custom dictionares, separated by comma
  • setCaseSensitive: defaults to false. Might affect accuracy
  • setDoubleVariants: enables extra check for word combinations, more accuracy at performance
  • setShortCircuit: faster but less accurate mode
Example:

spell_checker = NorvigSweetingApproach() \
  .setInputCols(["token"]) \
  .setOutputCol("spell") \
  .setCorpusPath("./sherlockholmes.txt")

This annotator retrieves tokens and makes corrections automatically if not found on an english dictionary
Type: Token
Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.
Requires: RegexTokenizer
Functions:

  • setDictPath: path to english words dictionary
  • setCorpusPath: path to training corpus. Can be any good text.
  • setCorpusFormat(format): Allowed “txt” or “txtds”. The latter uses spark dataframes from text
  • setSlangPath: path to custom dictionares, separated by comma
  • setCaseSensitive: defaults to false. Might affect accuracy
  • setDoubleVariants: enables extra check for word combinations, more accuracy at performance
  • setShortCircuit: faster but less accurate mode
Example:

val spellChecker = new NorvigSweetingApproach()
  .setInputCols(Array("normalized"))
  .setOutputCol("spell")
  .setCorpusPath("./sherlockholmes.txt")

14. ViveknSentimentDetector: Sentiment analysis

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setPositiveSource(path)
  • setNegativeSource(path)
  • setPruneCorpus(true): when training on small data you may want to disable this to not cut off unfrequent words

Input: File or folder of text files of positive and negative data
Example:

sentiment_detector = SentimentDetectorModel() \
    .setInputCols(["lemma", "sentence"]) \
    .setOutputCol("sentiment")

Scores a sentence for a sentiment
Type: sentiment
Requires: Document, Token
Functions:

  • setPositiveSourcePath(path)
  • setNegativeSourcePath(path)
  • setCorpusPrune(true): when training on small data you may want to disable this to not cut off unfrequent words

Input: File or folder of text files of positive and negative data
Example:

new ViveknSentimentApproach()
      .setInputCols(Array("token", "sentence"))
      .setOutputCol("vivekn")
      .setPositiveSourcePath("./positive/1.txt")
      .setNegativeSourcePath("./negative/1.txt")
      .setCorpusPrune(false)

15. Finisher: Getting data out

Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into string.
Example:

finisher = Finisher() \
  .setInputCols(["sentiment"]) \
  .setIncludeKeys(True)
Settable parameters are:
  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(True) -> Whether to remove intermediate annotations
  • setValueSplitSymbol("#") -> split values within an annotation character
  • setAnnotationSplitSymbol("@") -> split values between annotations character
  • setIncludeKeys(False) -> Whether to include metadata keys. Sometimes useful in some annotations
  • setOutputAsArray(False) -> Whether to output as Array. Useful as input for other Spark transformers.

Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into string.
Example:

val finisher = new Finisher()
  .setInputCols("token")
Settable parameters are:
  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(true) -> Whether to remove intermediate annotations
  • setValueSplitSymbol("#") -> split values within an annotation character
  • setAnnotationSplitSymbol("@") -> split values between annotations character
  • setIncludeKeys(false) -> Whether to include metadata keys. Sometimes useful in some annotations
  • setOutputAsArray(false) -> Whether to output as Array. Useful as input for other Spark transformers.

16. TokenAssembler: Getting data reshaped

This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.
Example:

token_assembler = TokenAssembler() \
.setInputCols(["normalized"]) \
.setOutputCol("assembled")
Settable parameters are:
  • setInputCol()
  • setOutputCol()

This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, spell checked, etc, in order to use this document annotation in further annotators.
Example:

token_assembler = TokenAssembler()
.setInputCols("normalized")
.setOutputCol("assembled")
Settable parameters are:
  • setInputCol()
  • setOutputCol()