`sparknlp.annotator.ner.ner_crf`#

Contains classes for NerCrf.

Module Contents#

Classes#

`NerCrfApproach`	Algorithm for training a Named Entity Recognition Model
`NerCrfModel`	Extracts Named Entities based on a CRF Model.

class NerCrfApproach[source]#

Algorithm for training a Named Entity Recognition Model

For instantiated/pretrained models, see NerCrfModel.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY.

Excluding the label, this can be done with for example:

Optionally the user can provide an entity dictionary file with setExternalFeatures() for better accuracy.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS`	`NAMED_ENTITY`

Parameters:

labelColumn: Column with label per each token
entities: Entities to recognize
minEpochs: Minimum number of epochs to train, by default 0
maxEpochs: Maximum number of epochs to train, by default 1000
verbose: Level of verbosity during training, by default 4
randomSeed: Random seed
l2: L2 regularization coefficient, by default 1.0
c0: c0 params defining decay speed for gradient, by default 2250000
lossEps: If Epoch relative improvement less than eps then training is stopped, by default 0.001
minW: Features with less weights then this param value will be filtered
includeConfidence: Whether to include confidence scores in annotation metadata, by default False
externalFeatures: Additional dictionaries paths to use as a features

See also

NerDLApproach: for a deep learning based approach
NerConverter: to further process the results

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> from pyspark.ml import Pipeline

This CoNLL dataset already includes a sentence, token, POS tags and label column with their respective annotator types. If a custom dataset is used, these need to be defined with for example:

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> posTagger = PerceptronModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("pos")

Then training can start:

>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings") \
...     .setCaseSensitive(False)
>>> nerTagger = NerCrfApproach() \
...     .setInputCols(["sentence", "token", "pos", "embeddings"]) \
...     .setLabelColumn("label") \
...     .setMinEpochs(1) \
...     .setMaxEpochs(3) \
...     .setOutputCol("ner")
>>> pipeline = Pipeline().setStages([
...     embeddings,
...     nerTagger
... ])

We use the sentences, tokens, POS tags and labels from the CoNLL dataset.

>>> conll = CoNLL()
>>> trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
>>> pipelineModel = pipeline.fit(trainingData)

setL2(l2value)[source]#

Sets L2 regularization coefficient, by default 1.0.

Parameters:

l2valuefloat: L2 regularization coefficient

setC0(c0value)[source]#

Sets c0 params defining decay speed for gradient, by default 2250000.

Parameters:

c0valueint: c0 params defining decay speed for gradient

setLossEps(eps)[source]#

Sets If Epoch relative improvement less than eps then training is stopped, by default 0.001.

Parameters:

epsfloat: The threshold

setMinW(w)[source]#

Sets minimum weight value.

Features with less weights then this param value will be filtered.

Parameters:

wfloat: Minimum weight value

setExternalFeatures(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets Additional dictionaries paths to use as a features.

Parameters:

pathstr: Path to the source files
delimiterstr: Delimiter for the dictionary file. Can also be set it options.
read_asstr, optional: How to read the file, by default ReadAs.TEXT
optionsdict, optional: Options to read the resource, by default {“format”: “text”}

setIncludeConfidence(b)[source]#

Sets whether to include confidence scores in annotation metadata, by default False.

Parameters:

bbool: Whether to include the confidence value in the output.

setVerbose(verboseValue)[source]#

Sets level of verbosity during training.

Parameters:

verboseValueint: Level of verbosity

class NerCrfModel(classname='com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel', java_model=None)[source]#

Extracts Named Entities based on a CRF Model.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS. These can be extracted with for example

a SentenceDetector,
a Tokenizer and
a PerceptronModel.

This is the instantiated model of the NerCrfApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> nerTagger = NerCrfModel.pretrained() \
...     .setInputCols(["sentence", "token", "word_embeddings", "pos"]) \
...     .setOutputCol("ner")

The default model is "ner_crf", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS`	`NAMED_ENTITY`

Parameters:

includeConfidence: Whether to include confidence scores in annotation metadata, by default False

See also

NerDLModel: for a deep learning based approach
NerConverter: to further process the results

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

First extract the prerequisites for the NerCrfModel

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("word_embeddings")
>>> posTagger = PerceptronModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("pos")

Then NER can be extracted

>>> nerTagger = NerCrfModel.pretrained() \
...     .setInputCols(["sentence", "token", "word_embeddings", "pos"]) \
...     .setOutputCol("ner")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     embeddings,
...     posTagger,
...     nerTagger
... ])
>>> data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[I-ORG, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

setIncludeConfidence(b)[source]#

Sets whether to include confidence scores in annotation metadata, by default False.

Parameters:

bbool: Whether to include the confidence value in the output.

static pretrained(name='ner_crf', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “ner_crf”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

NerCrfModel: The restored model

sparknlp.annotator.ner.ner_crf#

Module Contents#

Classes#

`sparknlp.annotator.ner.ner_crf`#