p

com.johnsnowlabs.nlp

annotators

package annotators

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class ChunkTokenizer extends Tokenizer

    Tokenizes and flattens extracted NER chunks.

    Tokenizes and flattens extracted NER chunks.

    The ChunkTokenizer will split the extracted NER CHUNK type Annotations and will create TOKEN type Annotations. The result is then flattened, resulting in a single array.

    For extended examples of usage, see the ChunkTokenizerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{ChunkTokenizer, TextMatcher, Tokenizer}
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val entityExtractor = new TextMatcher()
      .setInputCols("sentence", "token")
      .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT)
      .setOutputCol("entity")
    
    val chunkTokenizer = new ChunkTokenizer()
      .setInputCols("entity")
      .setOutputCol("chunk_token")
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        entityExtractor,
        chunkTokenizer
      ))
    
    val data = Seq(
      "Hello world, my name is Michael, I am an artist and I work at Benezar",
      "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(false)
    +-----------------------------------------------+---------------------------------------------------+
    |entity                                         |chunk_token                                        |
    +-----------------------------------------------+---------------------------------------------------+
    |[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
    |[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
    +-----------------------------------------------+---------------------------------------------------+
  2. class ChunkTokenizerModel extends TokenizerModel

    Instantiated model of the ChunkTokenizer.

    Instantiated model of the ChunkTokenizer. For usage and examples see the documentation of the main class.

  3. class Chunker extends AnnotatorModel[Chunker] with HasSimpleAnnotate[Chunker]

    This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.

    This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself. This example sentence will result in the form:

    "Peter Pipers employees are picking pecks of pickled peppers."
    "<NNP><NNP><NNS><VBP><VBG><NNS><IN><JJ><NNS><.>"

    To then extract these tags, regexParsers need to be set with e.g.:

    val chunker = new Chunker()
      .setInputCols("sentence", "pos")
      .setOutputCol("chunk")
      .setRegexParsers(Array("<NNP>+", "<NNS>+"))

    When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers.

    For more extended examples see the Spark NLP Workshop and the ChunkerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val POSTag = PerceptronModel.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("pos")
    
    val chunker = new Chunker()
      .setInputCols("sentence", "pos")
      .setOutputCol("chunk")
      .setRegexParsers(Array("<NNP>+", "<NNS>+"))
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        POSTag,
        chunker
      ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(chunk) as result").show(false)
    +-------------------------------------------------------------+
    |result                                                       |
    +-------------------------------------------------------------+
    |[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
    |[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
    |[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
    |[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
    +-------------------------------------------------------------+
    See also

    PerceptronModel for Part-Of-Speech tagging

  4. class DateMatcher extends AnnotatorModel[DateMatcher] with HasSimpleAnnotate[DateMatcher] with DateMatcherUtils

    Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

    Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

    Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.

    Reads the following kind of dates:

    "1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
    "Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
    "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
    "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
    "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

    For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

    Pretrained pipelines are available for this module, see Pipelines.

    For extended examples of usage, see the Spark NLP Workshop and the DateMatcherTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.DateMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val date = new DateMatcher()
      .setInputCols("document")
      .setOutputCol("date")
      .setAnchorDateYear(2020)
      .setAnchorDateMonth(1)
      .setAnchorDateDay(11)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      date
    ))
    
    val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("date").show(false)
    +-------------------------------------------------+
    |date                                             |
    +-------------------------------------------------+
    |[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
    |[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
    |[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
    +-------------------------------------------------+
    See also

    MultiDateMatcher for matching multiple dates in a document

  5. trait DateMatcherUtils extends Params
  6. class DocumentNormalizer extends AnnotatorModel[DocumentNormalizer] with HasSimpleAnnotate[DocumentNormalizer]

    Annotator which normalizes raw text from tagged text, e.g.

    Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

    For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val cleanUpPatterns = Array("<[^>]*>")
    
    val documentNormalizer = new DocumentNormalizer()
      .setInputCols("document")
      .setOutputCol("normalizedDocument")
      .setAction("clean")
      .setPatterns(cleanUpPatterns)
      .setReplacement(" ")
      .setPolicy("pretty_all")
      .setLowercase(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentNormalizer
    ))
    
    val text =
      """
    
    
    
      THE WORLD'S LARGEST WEB DEVELOPER SITE
    
    = THE WORLD'S LARGEST WEB DEVELOPER SITE =
    
    
    
    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..
    
    
    """
    val data = Seq(text).toDF("text")
    val pipelineModel = pipeline.fit(data)
    
    val result = pipelineModel.transform(data)
    result.selectExpr("normalizedDocument.result").show(truncate=false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  7. class Lemmatizer extends AnnotatorApproach[LemmatizerModel]

    Class to find lemmas out of words with the objective of returning a base dictionary word.

    Class to find lemmas out of words with the objective of returning a base dictionary word. Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource. Pretrained models can be loaded with LemmatizerModel.pretrained.

    For available pretrained models please see the Models Hub. For extended examples of usage, see the Spark NLP Workshop and the LemmatizerTestSpec.

    Example

    In this example, the lemma dictionary lemmas_small.txt has the form of

    ...
    pick	->	pick	picks	picking	picked
    peck	->	peck	pecking	pecked	pecks
    pickle	->	pickle	pickles	pickled	pickling
    pepper	->	pepper	peppers	peppered	peppering
    ...

    where each key is delimited by -> and values are delimited by \t

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Lemmatizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val lemmatizer = new Lemmatizer()
      .setInputCols(Array("token"))
      .setOutputCol("lemma")
      .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        lemmatizer
      ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
      .toDF("text")
    
    val result = pipeline.fit(data).transform(data)
    result.selectExpr("lemma.result").show(false)
    +------------------------------------------------------------------+
    |result                                                            |
    +------------------------------------------------------------------+
    |[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
    +------------------------------------------------------------------+
    See also

    LemmatizerModel for the instantiated model and pretrained models.

  8. class LemmatizerModel extends AnnotatorModel[LemmatizerModel] with HasSimpleAnnotate[LemmatizerModel]

    Instantiated Model of the Lemmatizer.

    Instantiated Model of the Lemmatizer. For usage and examples, please see the documentation of that class. For available pretrained models please see the Models Hub.

    Example

    The lemmatizer from the example of the Lemmatizer can be replaced with:

    val lemmatizer = LemmatizerModel.pretrained()
      .setInputCols(Array("token"))
      .setOutputCol("lemma")

    This will load the default pretrained model which is "lemma_antbnc".

    See also

    Lemmatizer

  9. class MultiDateMatcher extends AnnotatorModel[MultiDateMatcher] with HasSimpleAnnotate[MultiDateMatcher] with DateMatcherUtils

    Matches standard date formats into a provided format.

    Matches standard date formats into a provided format.

    Reads the following kind of dates:

    "1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
    "Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
    "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
    "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
    "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

    For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

    For extended examples of usage, see the Spark NLP Workshop and the MultiDateMatcherTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val date = new MultiDateMatcher()
      .setInputCols("document")
      .setOutputCol("date")
      .setAnchorDateYear(2020)
      .setAnchorDateMonth(1)
      .setAnchorDateDay(11)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      date
    ))
    
    val data = Seq("I saw him yesterday and he told me that he will visit us next week")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(date) as dates").show(false)
    +-----------------------------------------------+
    |dates                                          |
    +-----------------------------------------------+
    |[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
    |[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
    +-----------------------------------------------+
  10. class NGramGenerator extends AnnotatorModel[NGramGenerator] with HasSimpleAnnotate[NGramGenerator]

    A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK).

    A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

    When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

    For more extended examples see the Spark NLP Workshop and the NGramGeneratorTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.NGramGenerator
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val nGrams = new NGramGenerator()
      .setInputCols("token")
      .setOutputCol("ngrams")
      .setN(2)
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        nGrams
      ))
    
    val data = Seq("This is my sentence.").toDF("text")
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(ngrams) as result").show(false)
    +------------------------------------------------------------+
    |result                                                      |
    +------------------------------------------------------------+
    |[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
    |[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
    |[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
    |[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
    +------------------------------------------------------------+
  11. class Normalizer extends AnnotatorApproach[NormalizerModel]

    Annotator that cleans out tokens.

    Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

    For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
    import org.apache.spark.ml.Pipeline
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val normalizer = new Normalizer()
      .setInputCols("token")
      .setOutputCol("normalized")
      .setLowercase(true)
      .setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
    // if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      normalizer
    ))
    
    val data = Seq("John and Peter are brothers. However they don't support each other that much.")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("normalized.result").show(truncate = false)
    +----------------------------------------------------------------------------------------+
    |result                                                                                  |
    +----------------------------------------------------------------------------------------+
    |[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
    +----------------------------------------------------------------------------------------+
  12. class NormalizerModel extends AnnotatorModel[NormalizerModel] with HasSimpleAnnotate[NormalizerModel]

    Instantiated Model of the Normalizer.

    Instantiated Model of the Normalizer. For usage and examples, please see the documentation of that class.

    See also

    Normalizer for the base class

  13. trait ReadablePretrainedLemmatizer extends ParamsAndFeaturesReadable[LemmatizerModel] with HasPretrained[LemmatizerModel]
  14. trait ReadablePretrainedStopWordsCleanerModel extends ParamsAndFeaturesReadable[StopWordsCleaner] with HasPretrained[StopWordsCleaner]
  15. trait ReadablePretrainedTextMatcher extends ParamsAndFeaturesReadable[TextMatcherModel] with HasPretrained[TextMatcherModel]
  16. trait ReadablePretrainedTokenizer extends ParamsAndFeaturesReadable[TokenizerModel] with HasPretrained[TokenizerModel]
  17. class RecursiveTokenizer extends AnnotatorApproach[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

    Tokenizes raw text recursively based on a handful of definable rules.

    Tokenizes raw text recursively based on a handful of definable rules.

    Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

    • prefixes: Strings that will be split when found at the beginning of token.
    • suffixes: Strings that will be split when found at the end of token.
    • infixes: Strings that will be split when found at the middle of token.
    • whitelist: Whitelist of strings not to split

    For extended examples of usage, see the Spark NLP Workshop and the TokenizerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new RecursiveTokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer
    ))
    
    val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("token.result").show(false)
    +------------------------------------------------------------------+
    |result                                                            |
    +------------------------------------------------------------------+
    |[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
    +------------------------------------------------------------------+
  18. class RecursiveTokenizerModel extends AnnotatorModel[RecursiveTokenizerModel] with HasSimpleAnnotate[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

    Instantiated model of the RecursiveTokenizer.

    Instantiated model of the RecursiveTokenizer. For usage and examples see the documentation of the main class.

  19. class RegexMatcher extends AnnotatorApproach[RegexMatcherModel]

    Uses a reference file to match a set of regular expressions and associate them with a provided identifier.

    Uses a reference file to match a set of regular expressions and associate them with a provided identifier.

    A dictionary of predefined regular expressions must be provided with setExternalRules. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource.

    Pretrained pipelines are available for this module, see Pipelines.

    For extended examples of usage, see the Spark NLP Workshop and the RegexMatcherTestSpec.

    Example

    In this example, the rules.txt has the form of

    the\s\w+, followed by 'the'
    ceremonies, ceremony

    where each regex is separated by the identifier by ","

    import ResourceHelper.spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.RegexMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    
    val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    
    val regexMatcher = new RegexMatcher()
      .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",")
      .setInputCols(Array("sentence"))
      .setOutputCol("regex")
      .setStrategy("MATCH_ALL")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))
    
    val data = Seq(
      "My first sentence with the first rule. This is my second sentence with ceremonies rule."
    ).toDF("text")
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(regex) as result").show(false)
    +--------------------------------------------------------------------------------------------+
    |result                                                                                      |
    +--------------------------------------------------------------------------------------------+
    |[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
    |[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
    +--------------------------------------------------------------------------------------------+
  20. class RegexMatcherModel extends AnnotatorModel[RegexMatcherModel] with HasSimpleAnnotate[RegexMatcherModel]

    Instantiated model of the RegexMatcher.

    Instantiated model of the RegexMatcher. For usage and examples see the documentation of the main class.

  21. class RegexTokenizer extends AnnotatorModel[RegexTokenizer] with HasSimpleAnnotate[RegexTokenizer]

    A tokenizer that splits text by a regex pattern.

    A tokenizer that splits text by a regex pattern.

    The pattern needs to be set with setPattern and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.RegexTokenizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val regexTokenizer = new RegexTokenizer()
      .setInputCols("document")
      .setOutputCol("regexToken")
      .setToLowercase(true)
      .setPattern("\\s+")
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        regexTokenizer
      ))
    
    val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("regexToken.result").show(false)
    +-------------------------------------------------------+
    |result                                                 |
    +-------------------------------------------------------+
    |[this, is, my, first, sentence., this, is, my, second.]|
    +-------------------------------------------------------+
  22. class Stemmer extends AnnotatorModel[Stemmer] with HasSimpleAnnotate[Stemmer]

    Returns hard-stems out of words with the objective of retrieving the meaningful part of the word.

    Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val stemmer = new Stemmer()
      .setInputCols("token")
      .setOutputCol("stem")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      stemmer
    ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("stem.result").show(truncate = false)
    +-------------------------------------------------------------+
    |result                                                       |
    +-------------------------------------------------------------+
    |[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
    +-------------------------------------------------------------+
  23. class StopWordsCleaner extends AnnotatorModel[StopWordsCleaner] with HasSimpleAnnotate[StopWordsCleaner]

    This annotator takes a sequence of strings (e.g.

    This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

    By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String]) or loaded from pretrained models using pretrained of its companion object.

    val stopWords = StopWordsCleaner.pretrained()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    // will load the default pretrained model `"stopwords_en"`.

    For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and StopWordsCleanerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val stopWords = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        stopWords
      ))
    
    val data = Seq(
      "This is my first sentence. This is my second.",
      "This is my third sentence. This is my forth."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("cleanTokens.result").show(false)
    +-------------------------------+
    |result                         |
    +-------------------------------+
    |[first, sentence, ., second, .]|
    |[third, sentence, ., forth, .] |
    +-------------------------------+
  24. class TextMatcher extends AnnotatorApproach[TextMatcherModel] with ParamsAndFeaturesWritable

    Annotator to match exact phrases (by token) provided in a file against a Document.

    Annotator to match exact phrases (by token) provided in a file against a Document.

    A text file of predefined phrases must be provided with setEntities. The text file can als be set directly as an ExternalResource.

    For extended examples of usage, see the Spark NLP Workshop and the TextMatcherTestSpec.

    Example

    In this example, the entities file is of the form

    ...
    dolore magna aliqua
    lorem ipsum dolor. sit
    laborum
    ...

    where each line represents an entity phrase to be extracted.

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.TextMatcher
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
    val entityExtractor = new TextMatcher()
      .setInputCols("document", "token")
      .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
      .setOutputCol("entity")
      .setCaseSensitive(false)
      .setTokenizer(tokenizer.fit(data))
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(entity) as result").show(false)
    +------------------------------------------------------------------------------------------+
    |result                                                                                    |
    +------------------------------------------------------------------------------------------+
    |[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
    |[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
    |[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
    +------------------------------------------------------------------------------------------+
    See also

    BigTextMatcher to match large amounts of text

  25. class TextMatcherModel extends AnnotatorModel[TextMatcherModel] with HasSimpleAnnotate[TextMatcherModel]

    Instantiated model of the TextMatcher.

    Instantiated model of the TextMatcher. For usage and examples see the documentation of the main class.

  26. class Token2Chunk extends AnnotatorModel[Token2Chunk] with HasSimpleAnnotate[Token2Chunk]

    Converts TOKEN type Annotations to CHUNK type.

    Converts TOKEN type Annotations to CHUNK type.

    This can be useful if a entities have been already extracted as TOKEN and following annotators require CHUNK types.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}
    
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val token2chunk = new Token2Chunk()
      .setInputCols("token")
      .setOutputCol("chunk")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      token2chunk
    ))
    
    val data = Seq("One Two Three Four").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(chunk) as result").show(false)
    +------------------------------------------+
    |result                                    |
    +------------------------------------------+
    |[chunk, 0, 2, One, [sentence -> 0], []]   |
    |[chunk, 4, 6, Two, [sentence -> 0], []]   |
    |[chunk, 8, 12, Three, [sentence -> 0], []]|
    |[chunk, 14, 17, Four, [sentence -> 0], []]|
    +------------------------------------------+
  27. class Tokenizer extends AnnotatorApproach[TokenizerModel]

    Tokenizes raw text in document type columns into TokenizedSentence .

    Tokenizes raw text in document type columns into TokenizedSentence .

    This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

    Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

    For extended examples of usage see the Spark NLP Workshop and Tokenizer test class

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import org.apache.spark.ml.Pipeline
    
    val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
    val result = pipeline.transform(data)
    
    result.selectExpr("token.result").show(false)
    +-----------------------------------------------------------------------+
    |output                                                                 |
    +-----------------------------------------------------------------------+
    |[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
    +-----------------------------------------------------------------------+
  28. class TokenizerModel extends AnnotatorModel[TokenizerModel] with HasSimpleAnnotate[TokenizerModel]

    Tokenizes raw text into word pieces, tokens.

    Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

    This class represents an already fitted Tokenizer model.

    See the main class Tokenizer for more examples of usage.

Value Members

  1. object ChunkTokenizer extends DefaultParamsReadable[ChunkTokenizer] with Serializable

    This is the companion object of ChunkTokenizer.

    This is the companion object of ChunkTokenizer. Please refer to that class for the documentation.

  2. object ChunkTokenizerModel extends ParamsAndFeaturesReadable[ChunkTokenizerModel] with Serializable
  3. object Chunker extends DefaultParamsReadable[Chunker] with Serializable

    This is the companion object of Chunker.

    This is the companion object of Chunker. Please refer to that class for the documentation.

  4. object DateMatcher extends DefaultParamsReadable[DateMatcher] with Serializable

    This is the companion object of DateMatcher.

    This is the companion object of DateMatcher. Please refer to that class for the documentation.

  5. object DocumentNormalizer extends DefaultParamsReadable[DocumentNormalizer] with Serializable

    This is the companion object of DocumentNormalizer.

    This is the companion object of DocumentNormalizer. Please refer to that class for the documentation.

  6. object EnglishStemmer
  7. object Lemmatizer extends DefaultParamsReadable[Lemmatizer] with Serializable

    This is the companion object of Lemmatizer.

    This is the companion object of Lemmatizer. Please refer to that class for the documentation.

  8. object LemmatizerModel extends ReadablePretrainedLemmatizer with Serializable

    This is the companion object of LemmatizerModel.

    This is the companion object of LemmatizerModel. Please refer to that class for the documentation.

  9. object MultiDateMatcher extends DefaultParamsReadable[MultiDateMatcher] with Serializable

    This is the companion object of MultiDateMatcher.

    This is the companion object of MultiDateMatcher. Please refer to that class for the documentation.

  10. object NGramGenerator extends ParamsAndFeaturesReadable[NGramGenerator] with Serializable
  11. object Normalizer extends DefaultParamsReadable[Normalizer] with Serializable

    This is the companion object of Normalizer.

    This is the companion object of Normalizer. Please refer to that class for the documentation.

  12. object NormalizerModel extends ParamsAndFeaturesReadable[NormalizerModel] with Serializable
  13. object RegexMatcher extends DefaultParamsReadable[RegexMatcher] with Serializable

    This is the companion object of RegexMatcher.

    This is the companion object of RegexMatcher. Please refer to that class for the documentation.

  14. object RegexMatcherModel extends ParamsAndFeaturesReadable[RegexMatcherModel] with Serializable
  15. object Stemmer extends DefaultParamsReadable[Stemmer] with Serializable

    This is the companion object of Stemmer.

    This is the companion object of Stemmer. Please refer to that class for the documentation.

  16. object StopWordsCleaner extends ParamsAndFeaturesReadable[StopWordsCleaner] with ReadablePretrainedStopWordsCleanerModel with Serializable
  17. object TextMatcher extends DefaultParamsReadable[TextMatcher] with Serializable

    This is the companion object of TextMatcher.

    This is the companion object of TextMatcher. Please refer to that class for the documentation.

  18. object TextMatcherModel extends ReadablePretrainedTextMatcher with Serializable

    This is the companion object of TextMatcherModel.

    This is the companion object of TextMatcherModel. Please refer to that class for the documentation.

  19. object Token2Chunk extends DefaultParamsReadable[Token2Chunk] with Serializable

    This is the companion object of Token2Chunk.

    This is the companion object of Token2Chunk. Please refer to that class for the documentation.

  20. object Tokenizer extends DefaultParamsReadable[Tokenizer] with Serializable

    This is the companion object of Tokenizer.

    This is the companion object of Tokenizer. Please refer to that class for the documentation.

  21. object TokenizerModel extends ReadablePretrainedTokenizer with Serializable

    This is the companion object of TokenizerModel.

    This is the companion object of TokenizerModel. Please refer to that class for the documentation.

Ungrouped