sparknlp.annotator.matcher.big_text_matcher#

Contains classes for the BigTextMatcher.

Module Contents#

Classes#

BigTextMatcher

Annotator to match exact phrases (by token) provided in a file against a

BigTextMatcherModel

Instantiated model of the BigTextMatcher.

class BigTextMatcher[source]#

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setStoragePath.

In contrast to the normal TextMatcher, the BigTextMatcher is designed for large corpora.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN

CHUNK

Parameters:
entities

ExternalResource for entities

caseSensitive

whether to ignore case in index lookups, by default True

mergeOverlapping

whether to merge overlapping matched chunks, by default False

tokenizer

TokenizerModel to use to tokenize input file for building a Trie

Examples

In this example, the entities file is of the form:

...
dolore magna aliqua
lorem ipsum dolor. sit
laborum
...

where each line represents an entity phrase to be extracted.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols("document") \
...     .setOutputCol("token")
>>> data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
>>> entityExtractor = BigTextMatcher() \
...     .setInputCols("document", "token") \
...     .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
...     .setOutputCol("entity") \
...     .setCaseSensitive(False)
>>> pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
>>> results = pipeline.fit(data).transform(data)
>>> results.selectExpr("explode(entity)").show(truncate=False)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+
setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets ExternalResource for entities.

Parameters:
pathstr

Path to the resource

read_asstr, optional

How to read the resource, by default ReadAs.TEXT

optionsdict, optional

Options for reading the resource, by default {“format”: “text”}

setCaseSensitive(b)[source]#

Sets whether to ignore case in index lookups, by default True.

Parameters:
bbool

Whether to ignore case in index lookups

setMergeOverlapping(b)[source]#

Sets whether to merge overlapping matched chunks, by default False.

Parameters:
bbool

Whether to merge overlapping matched chunks

setTokenizer(tokenizer_model)[source]#

Sets TokenizerModel to use to tokenize input file for building a Trie.

Parameters:
tokenizer_modelTokenizerModel

TokenizerModel to use to tokenize input file

class BigTextMatcherModel(classname='com.johnsnowlabs.nlp.annotators.btm.TextMatcherModel', java_model=None)[source]#

Instantiated model of the BigTextMatcher.

This is the instantiated model of the BigTextMatcher. For training your own model, please see the documentation of that class.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN

CHUNK

Parameters:
caseSensitive

Whether to ignore case in index lookups

mergeOverlapping

Whether to merge overlapping matched chunks, by default False

searchTrie

SearchTrie

setMergeOverlapping(b)[source]#

Sets whether to merge overlapping matched chunks, by default False.

Parameters:
vbool

Whether to merge overlapping matched chunks, by default False

setCaseSensitive(v)[source]#

Sets whether to ignore case in index lookups.

Parameters:
bbool

Whether to ignore case in index lookups

static pretrained(name, lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
TextMatcherModel

The restored model

static loadStorage(path, spark, storage_ref)[source]#

Loads the model from storage.

Parameters:
pathstr

Path to the model

sparkpyspark.sql.SparkSession

The current SparkSession

storage_refstr

Identifiers for the model parameters