package er
- Alphabetic
- Public
- All
Type Members
-
class
AhoCorasickAutomatonInternal extends Serializable
Aho-Corasick Algorithm: https://dl.acm.org/doi/10.1145/360825.360855 A simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text.
Aho-Corasick Algorithm: https://dl.acm.org/doi/10.1145/360825.360855 A simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass. The complexity of constructing a pattern matching machine and searching the text is linear to the total length of given patterns and the length of a text, respectively.
-
class
EntityRulerInternalApproach extends AnnotatorApproach[EntityRulerInternalModel] with HasStorage with CheckLicense
Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity.
Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.
There are multiple ways and formats to set the extraction resource. It is possible to set it either as a "JSON", "JSONL" or "CSV" file. A path to the file needs to be provided to
setPatternsResource
. The file format needs to be set as the "format" field in theoption
parameter map and depending on the file type, additional parameters might need to be set.If the file is in a JSON format, then the rule definitions need to be given in a list with the fields "id", "label" and "patterns":
[ { "id": "person-regex", "label": "PERSON", "patterns": ["\\w+\\s\\w+", "\\w+-\\w+"] }, { "id": "locations-words", "label": "LOCATION", "patterns": ["Winterfell"] } ]
The same fields also apply to a file in the JSONL format:
{"id": "names-with-j", "label": "PERSON", "patterns": ["Jon", "John", "John Snow"]} {"id": "names-with-s", "label": "PERSON", "patterns": ["Stark", "Snow"]} {"id": "names-with-e", "label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}
In order to use a CSV file, an additional parameter "delimiter" needs to be set. In this case, the delimiter might be set by using
.setPatternsResource("patterns.csv", ReadAs.TEXT, Map("format"->"csv", "delimiter" -> "\\|"))
PERSON|Jon PERSON|John PERSON|John Snow LOCATION|Winterfell
Example
In this example, the entities file as the form of
PERSON|Jon PERSON|John PERSON|John Snow LOCATION|Winterfell
where each line represents an entity and the associated string delimited by "|".
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.er.EntityRulerInternalApproach import com.johnsnowlabs.nlp.util.io.ReadAs import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val entityRuler = new EntityRulerInternalApproach() .setInputCols("document", "token") .setOutputCol("entities") .setPatternsResource( path = "src/test/resources/entity-ruler/patterns.csv", readAs = ReadAs.TEXT, options = Map("format" -> "csv", "delimiter" -> "\\|") ) val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, entityRuler )) val data = Seq("Jon Snow wants to be lord of Winterfell.").toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("explode(entities)").show(false) +--------------------------------------------------------------------+ |col | +--------------------------------------------------------------------+ |[chunk, 0, 2, Jon, [entity -> PERSON, sentence -> 0], []] | |[chunk, 29, 38, Winterfell, [entity -> LOCATION, sentence -> 0], []]| +--------------------------------------------------------------------+
-
class
EntityRulerInternalModel extends AnnotatorModel[EntityRulerInternalModel] with HasSimpleAnnotate[EntityRulerInternalModel] with HasStorageModel with CheckLicense
Instantiated model of the EntityRulerApproach.
Instantiated model of the EntityRulerApproach. For usage and examples see the documentation of the main class.
- trait ReadablePretrainedEntityRulerInternal extends StorageReadable[EntityRulerInternalModel] with HasPretrained[EntityRulerInternalModel]
Value Members
- object EntityRulerInternalModel extends ReadablePretrainedEntityRulerInternal with Serializable