sparknlp_jsl.annotator.ner.ner_converter_internal#

Module Contents#

Classes#

CommonNerConverterInternalParams

Base class for NerConverterInternal classes.

NerConverterInternal

Converts IOB or IOB2 representations of entities to a more user-friendly format.

NerConverterInternalModel

Converts IOB or IOB2 representations of entities to a more user-friendly format.

class CommonNerConverterInternalParams#

Bases: sparknlp_jsl.annotator.filtering_params.FilteringParams

Base class for NerConverterInternal classes.

Contains the common parameters and methods used in all internal classes.

Parameters:
  • whiteList – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

  • blackList – If defined, list of entities to ignore. The rest will be proccessed. Do not include IOB prefix on labels

  • preservePosition – Whether to preserve the original position of the tokens in the original document or use the modified tokens

  • greedyMode – Whether to ignore B tags for contiguous tokens of same entity same

  • threshold – Confidence threshold to filter the chunk entities

  • ignoreStopWords – If defined, list of stop words to ignore if present between two entities. It should be a list of tokens/words or characters, and when two entities of the same type are separated by those words, these entities can be combined to produce a single, larger chunk.

blackList#
caseSensitive#
criteria#
doExceptionHandling#
filterValue#
greedyMode#
ignoreStopWords#
preservePosition#
regex#
threshold#
whiteList#
setAllowList(entities: list)#

Sets the whiteList parameter value.

Parameters:

entities (list) – The list of white-listed entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setAllowList(["PER", "LOC"])
setBlackList(value)#

Sets If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels

Parameters:

value (List[str]) – If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels

setCaseSensitive(value)#

Determines whether the definitions of the white listed and black listed entities are case sensitive or not.

Parameters:

value (bool) – Whether white listed and black listed entities are case sensitive or not. Default: True.

setCriteria(value)#

It is used to compare black and white listed values with the result of the Annotation.

Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.

isin : Filter by the chunk regex : Filter by using a regex

Parameters:

value (string) – It is used to compare black and white listed values with the result of the Annotation. Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.

setDenyList(entities)#

Sets the blackList parameter value.

Parameters:

entities (list) – The list of black-listed entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setDenyList(["PER", "LOC"])
setDoExceptionHandling(value: bool)#
  • If true, exceptions are handled.

  • If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one.

  • This comes with a performance penalty.

Parameters:

value (bool) – True or False to handle exceptions.

setFilterValue(value)#

Sets possible values ‘result’ or ‘entity’.

If the value is ‘result’, It filters according to the result of the Annotation. If the value is ‘entity’, It filters according to the entity field in the metadata of the Annotation.

Parameters:

value (string) – possible values are ‘result’ and ‘entity’.

setGreedyMode(greedy_mode: bool)#

Sets the greedyMode parameter value.

Parameters:

greedy_mode (bool) – True or False to ignore B tags for contiguous tokens of same entity same.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setGreedyMode(True)
setIgnoreStopWords(stopwords: list)#

Sets the ignoreStopWords parameter value.

Parameters:

stopwords (list) – List of stop words to ignore.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setIgnoreStopWords(["a", "an", "the"])
setPreservePosition(preserve_position: bool)#

Sets the preservePosition parameter value.

Parameters:

preserve_position (bool) – True or False to preserve the original position of the tokens in the original document or use the modified tokens.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setPreservePosition(True)
setRegex(value)#

Sets If defined, list of regex to process the chunks.

Parameters:

value (List[str]) – If defined, list of regex to process the chunks

setThreshold(threshold: float)#

Sets the threshold parameter value.

Parameters:

threshold (float) – Confidence threshold to filter the chunk entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setThreshold(0.5)
setWhiteList(value)#

Sets If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

Parameters:

value (List[str]) – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

class NerConverterInternal#

Bases: sparknlp_jsl.common.AnnotatorApproach, CommonNerConverterInternalParams, sparknlp_jsl.annotator.source_tracking_metadata_params.SourceTrackingMetadataParams

Converts IOB or IOB2 representations of entities to a more user-friendly format.

This is the AnnotatorApproach version of the NerConverterInternalModel annotator.

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered.

This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList, greedyMode, threshold, and ignoreStopWords that are not available in the NerConverter annotator.

See also Inside–outside–beginning (tagging) for more information.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, NAMED_ENTITY

CHUNK

Parameters:
  • replaceDictResource – If defined, path to a dictionary file to replace the tokens with.

  • replaceLabels – If defined, a dictionary that maps old to new labels.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
>>> documentAssembler = DocumentAssembler() \
...    .setInputCol("text") \
...    .setOutputCol("document")
>>> sentenceDetector = SentenceDetector() \
...    .setInputCols(["document"]) \
...    .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...    .setInputCols(["sentence"]) \
...    .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
...    .setInputCols(["sentence", "token"])\
...    .setOutputCol("embeddings")
>>> nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
...    .setInputCols(["sentence", "token", "embeddings"]) \
...    .setOutputCol("ner")
>>> nerConverter = NerConverterInternal() \
...    .setInputCols(["sentence", "token", "ner"]) \
...    .setOutputCol("ner_chunk")
...
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     sentenceDetector,
...     tokenizer,
...     embeddings,
...     nerModel,
...     nerConverter])
allPossibleFieldsToStandardize#
blackList#
caseSensitive#
criteria#
doExceptionHandling#
filterValue#
getter_attrs = []#
greedyMode#
ignoreStopWords#
includeOutputColumn#
includeStandardField#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
name = 'NerConverterInternal'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
outputColumnKey#
preservePosition#
regex#
replaceDictResource#
replaceLabels#
resetSentenceIndices#
standardFieldKey#
threshold#
uid#
whiteList#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setAllPossibleFieldsToStandardize(fields)#

Sets array with all possible fields containing the value to write in the standard field ordered by priority

Parameters:

fields (list) – array with all possible fields containing the value to write in the standard field ordered by priority

setAllowList(entities: list)#

Sets the whiteList parameter value.

Parameters:

entities (list) – The list of white-listed entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setAllowList(["PER", "LOC"])
setBlackList(value)#

Sets If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels

Parameters:

value (List[str]) – If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels

setCaseSensitive(value)#

Determines whether the definitions of the white listed and black listed entities are case sensitive or not.

Parameters:

value (bool) – Whether white listed and black listed entities are case sensitive or not. Default: True.

setCriteria(value)#

It is used to compare black and white listed values with the result of the Annotation.

Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.

isin : Filter by the chunk regex : Filter by using a regex

Parameters:

value (string) – It is used to compare black and white listed values with the result of the Annotation. Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.

setDenyList(entities)#

Sets the blackList parameter value.

Parameters:

entities (list) – The list of black-listed entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setDenyList(["PER", "LOC"])
setDoExceptionHandling(value: bool)#
  • If true, exceptions are handled.

  • If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one.

  • This comes with a performance penalty.

Parameters:

value (bool) – True or False to handle exceptions.

setFilterValue(value)#

Sets possible values ‘result’ or ‘entity’.

If the value is ‘result’, It filters according to the result of the Annotation. If the value is ‘entity’, It filters according to the entity field in the metadata of the Annotation.

Parameters:

value (string) – possible values are ‘result’ and ‘entity’.

setGreedyMode(greedy_mode: bool)#

Sets the greedyMode parameter value.

Parameters:

greedy_mode (bool) – True or False to ignore B tags for contiguous tokens of same entity same.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setGreedyMode(True)
setIgnoreStopWords(stopwords: list)#

Sets the ignoreStopWords parameter value.

Parameters:

stopwords (list) – List of stop words to ignore.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setIgnoreStopWords(["a", "an", "the"])
setIncludeOutputColumn(p)#

Sets whether to include a metadata key/value to specify the output column name for the annotation

Parameters:

p (bool) – whether to include a metadata key/value to specify the output column name for the annotation

setIncludeStandardField(p)#

Sets whether to include a metadata key/value to specify the output column name for the annotation

Parameters:

p (bool) – whether to include a metadata key/value to specify the output column name for the annotation

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setOutputColumnKey(s)#

Set key name for the source column value

Parameters:

s (str) – key name for the source column value

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setPreservePosition(preserve_position: bool)#

Sets the preservePosition parameter value.

Parameters:

preserve_position (bool) – True or False to preserve the original position of the tokens in the original document or use the modified tokens.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setPreservePosition(True)
setRegex(value)#

Sets If defined, list of regex to process the chunks.

Parameters:

value (List[str]) – If defined, list of regex to process the chunks

setReplaceDictResource(path: str, read_as: str = ReadAs.TEXT, options: dict = None)#

Sets the replaceDictResource parameter.

The method sets the replaceDictResource parameter to the path of the external resource. Usual usage is to set the path to a CSV file containing the dictionary pairs, one substitution per line, with the original label and the new label separated by a comma.

Parameters:
  • path (str) – Path to the external resource

  • read_as (str, optional) – How to read the resource, by default “TEXT”. Possible values: “TEXT” (ReadAs.TEXT), “SPARK” (ReadAs.SPARK) or “BINARY” (ReadAs.BINARY)

  • options (dict, optional) – Options for reading the resource, by default None

Examples

Reading a CSV file

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("replaced_ner_chunk")\
...    .setReplaceDictResource("path/to/dictionary.csv", "TEXT", {"delimiter": ","})

Reading a TSV file

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence","token", "jsl_ner"])\
...    .setOutputCol("replaced_ner_chunk")\
...    .setReplaceDictResource("path/to/dictionary.tsv", "TEXT", {"delimiter": "        "})
setReplaceLabels(labels: dict)#

Sets dictionary that maps old to new labels

Parameters:

labels (dict[str, str]) – Dictionary which maps old to new labels

Examples

Change the labels PER to PERSON and LOC to LOCATION:

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setReplaceLabels({"PER": "PERSON", "LOC": "LOCATION"})

Change ner_jsl entities for drugs.

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "jsl_ner"])\
...    .setOutputCol("replaced_ner_chunk")\
...    .setReplaceLabels(
...        {
...            "Drug_BrandName": "Drug",
...            "Frequency": "Drug_Frequency",
...            "Dosage": "Drug_Dosage",
...            "Strength": "Drug_Strength",
...        })
setResetSentenceIndices(value)#

Set whether to reset sentence indices to treat the entire output as if it originates from a single document.

When set to true, the metadata of each entity will be updated by assigning the sentence key a value of 0, effectively treating the entire output as if it originates from a single document. regardless of the original sentence boundaries. Default: False.

Parameters:

value (bool) – If set to true, sentence indices will be reset to treat the entire output as if it originates from a single document.

setStandardFieldKey(s)#

Set key name for the source column value

Parameters:

s (str) – key name for the source column value

setThreshold(threshold: float)#

Sets the threshold parameter value.

Parameters:

threshold (float) – Confidence threshold to filter the chunk entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setThreshold(0.5)
setWhiteList(value)#

Sets If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

Parameters:

value (List[str]) – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class NerConverterInternalModel(classname='com.johnsnowlabs.nlp.annotators.ner.NerConverterInternalModel', java_model=None)#

Bases: sparknlp_jsl.common.AnnotatorModelInternal, CommonNerConverterInternalParams, sparknlp_jsl.annotator.source_tracking_metadata_params.SourceTrackingMetadataParams

Converts IOB or IOB2 representations of entities to a more user-friendly format.

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, NAMED_ENTITY

CHUNK

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
>>> documentAssembler = DocumentAssembler() \
...    .setInputCol("text") \
...    .setOutputCol("document")
>>> sentenceDetector = SentenceDetector() \
...    .setInputCols(["document"]) \
...    .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...    .setInputCols(["sentence"]) \
...    .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
...    .setInputCols(["sentence", "token"])\
...    .setOutputCol("embeddings")
>>> nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
...    .setInputCols(["sentence", "token", "embeddings"]) \
...    .setOutputCol("ner")
>>> nerConverter = NerConverterInternal() \
...    .setInputCols(["sentence", "token", "ner"]) \
...    .setOutputCol("ner_chunk")
...
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     sentenceDetector,
...     tokenizer,
...     embeddings,
...     nerModel,
...     nerConverter])
blackList#
caseSensitive#
criteria#
doExceptionHandling#
filterValue#
getter_attrs = []#
greedyMode#
ignoreStopWords#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
name = 'NerConverterInternalModel'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
preservePosition#
regex#
skipLPInputColsValidation = True#
threshold#
uid#
whiteList#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setAllowList(entities: list)#

Sets the whiteList parameter value.

Parameters:

entities (list) – The list of white-listed entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setAllowList(["PER", "LOC"])
setBlackList(value)#

Sets If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels

Parameters:

value (List[str]) – If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels

setCaseSensitive(value)#

Determines whether the definitions of the white listed and black listed entities are case sensitive or not.

Parameters:

value (bool) – Whether white listed and black listed entities are case sensitive or not. Default: True.

setCriteria(value)#

It is used to compare black and white listed values with the result of the Annotation.

Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.

isin : Filter by the chunk regex : Filter by using a regex

Parameters:

value (string) – It is used to compare black and white listed values with the result of the Annotation. Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.

setDenyList(entities)#

Sets the blackList parameter value.

Parameters:

entities (list) – The list of black-listed entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setDenyList(["PER", "LOC"])
setDoExceptionHandling(value: bool)#
  • If true, exceptions are handled.

  • If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one.

  • This comes with a performance penalty.

Parameters:

value (bool) – True or False to handle exceptions.

setFilterValue(value)#

Sets possible values ‘result’ or ‘entity’.

If the value is ‘result’, It filters according to the result of the Annotation. If the value is ‘entity’, It filters according to the entity field in the metadata of the Annotation.

Parameters:

value (string) – possible values are ‘result’ and ‘entity’.

setForceInputTypeValidation(etfm)#
setGreedyMode(greedy_mode: bool)#

Sets the greedyMode parameter value.

Parameters:

greedy_mode (bool) – True or False to ignore B tags for contiguous tokens of same entity same.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setGreedyMode(True)
setIgnoreStopWords(stopwords: list)#

Sets the ignoreStopWords parameter value.

Parameters:

stopwords (list) – List of stop words to ignore.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setIgnoreStopWords(["a", "an", "the"])
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setPreservePosition(preserve_position: bool)#

Sets the preservePosition parameter value.

Parameters:

preserve_position (bool) – True or False to preserve the original position of the tokens in the original document or use the modified tokens.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setPreservePosition(True)
setRegex(value)#

Sets If defined, list of regex to process the chunks.

Parameters:

value (List[str]) – If defined, list of regex to process the chunks

setThreshold(threshold: float)#

Sets the threshold parameter value.

Parameters:

threshold (float) – Confidence threshold to filter the chunk entities.

Examples

>>> ner_converter = NerConverterInternal()\
...    .setInputCols(["sentence", "token", "ner"])\
...    .setOutputCol("ner_chunk")\
...    .setThreshold(0.5)
setWhiteList(value)#

Sets If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

Parameters:

value (List[str]) – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.