Packages

package matcher

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. trait ReadablePretrainedTextMatcherInternal extends ParamsAndFeaturesReadable[TextMatcherInternalModel] with HasPretrained[TextMatcherInternalModel]
  2. class SearchTrieInternal extends SearchTrie

    Immutable Collection that used for fast substring search Implementation of Aho-Corasick algorithm https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

  3. class TextMatcherInternal extends AnnotatorApproach[TextMatcherInternalModel] with TextMatcherInternalParams with ParamsAndFeaturesWritable

    Annotator to match exact phrases (by token) provided in a file against a Document.

    Annotator to match exact phrases (by token) provided in a file against a Document.

    A text file of predefined phrases must be provided with setEntities. The text file can als be set directly as an ExternalResource.

    For extended examples of usage, see the

    Example

    In this example, the entities file is of the form

    ...
    dolore magna aliqua
    lorem ipsum dolor. sit
    laborum
    ...

    where each line represents an entity phrase to be extracted.

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.TextMatcherInternal
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
    val entityExtractor = new TextMatcherInternal()
      .setInputCols("document", "token")
      .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
      .setOutputCol("entity")
      .setCaseSensitive(false)
      .setTokenizer(tokenizer.fit(data))
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(entity) as result").show(false)
    +------------------------------------------------------------------------------------------+
    |result                                                                                    |
    +------------------------------------------------------------------------------------------+
    |[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
    |[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
    |[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
    +------------------------------------------------------------------------------------------+
  4. class TextMatcherInternalModel extends AnnotatorModel[TextMatcherInternalModel] with HasSimpleAnnotate[TextMatcherInternalModel] with TextMatcherInternalParams with CheckLicense

    Instantiated model of the TextMatcherInternal.

    Instantiated model of the TextMatcherInternal. For usage and examples see the documentation of the main class.

  5. trait TextMatcherInternalParams extends Params with HasFeatures

    Trait containing parameters and helper methods for the TextMatcherInternal and TextMatcherInternalModel components.

Value Members

  1. object SearchTrieInternal extends Serializable
  2. object TextMatcherInternal extends DefaultParamsReadable[TextMatcherInternal] with Serializable

    This is the companion object of TextMatcherInternal.

    This is the companion object of TextMatcherInternal. Please refer to that class for the documentation.

  3. object TextMatcherInternalModel extends ReadablePretrainedTextMatcherInternal with Serializable

    This is the companion object of TextMatcherInternalModel.

    This is the companion object of TextMatcherInternalModel. Please refer to that class for the documentation.

Ungrouped