Packages

package pragmatic

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class CustomPragmaticMethod extends PragmaticMethod with Serializable

    Inspired on Kevin Dias, Ruby implementation: https://github.com/diasks2/pragmatic_segmenter This approach extracts sentence bounds by first formatting the data with RuleSymbols and then extracting bounds with a strong RegexBased rule application

  2. class DefaultPragmaticMethod extends PragmaticMethod with Serializable
  3. class MixedPragmaticMethod extends PragmaticMethod with Serializable
  4. class PragmaticContentFormatter extends AnyRef

    rule-based formatter that adds regex rules to different marking steps Symbols protect from ambiguous bounds to be considered splitters

  5. trait PragmaticMethod extends AnyRef

    Created by Saif Addin on 5/5/2017.

    Created by Saif Addin on 5/5/2017.

    Attributes
    protected
  6. class PragmaticSentenceExtractor extends AnyRef

    Reads through symbolized data, and computes the bounds based on regex rules following symbol meaning

  7. trait RuleSymbols extends AnyRef

    Base Symbols that may be extended later on.

    Base Symbols that may be extended later on. For now kept in the pragmatic scope.

  8. class SentenceDetector extends AnnotatorModel[SentenceDetector] with HasSimpleAnnotate[SentenceDetector] with SentenceDetectorParams

    Annotator that detects sentence boundaries using any provided approach.

    Annotator that detects sentence boundaries using any provided approach.

    Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

    For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentence
    ))
    
    val data = Seq("This is my first sentence. This my second. How about a third?").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(sentence) as sentences").show(false)
    +------------------------------------------------------------------+
    |sentences                                                         |
    +------------------------------------------------------------------+
    |[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
    |[document, 27, 41, This my second., [sentence -> 1], []]          |
    |[document, 43, 60, How about a third?, [sentence -> 2], []]       |
    +------------------------------------------------------------------+
    See also

    SentenceDetectorDLModel for pretrained models

Value Members

  1. object PragmaticContentFormatter
  2. object PragmaticDictionaries

    This is a dictionary that contains common english abbreviations that should be considered sentence bounds

  3. object PragmaticSymbols extends RuleSymbols

    Extends RuleSymbols with specific symbols used for the pragmatic approach.

    Extends RuleSymbols with specific symbols used for the pragmatic approach. Right now, the only one.

  4. object SentenceDetector extends DefaultParamsReadable[SentenceDetector] with Serializable

Ungrouped