sparknlp_jsl.annotator.ner.ner_converter_internal
#
Module Contents#
Classes#
Base class for NerConverterInternal classes. |
|
Converts IOB or IOB2 representations of entities to a more user-friendly format. |
|
Converts IOB or IOB2 representations of entities to a more user-friendly format. |
- class CommonNerConverterInternalParams#
Bases:
sparknlp_jsl.annotator.filtering_params.FilteringParams
Base class for NerConverterInternal classes.
Contains the common parameters and methods used in all internal classes.
- Parameters:
whiteList – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
blackList – If defined, list of entities to ignore. The rest will be proccessed. Do not include IOB prefix on labels
preservePosition – Whether to preserve the original position of the tokens in the original document or use the modified tokens
greedyMode – Whether to ignore B tags for contiguous tokens of same entity same
threshold – Confidence threshold to filter the chunk entities
ignoreStopWords – If defined, list of stop words to ignore if present between two entities. It should be a list of tokens/words or characters, and when two entities of the same type are separated by those words, these entities can be combined to produce a single, larger chunk.
- blackList#
- caseSensitive#
- criteria#
- doExceptionHandling#
- filterValue#
- greedyMode#
- ignoreStopWords#
- preservePosition#
- regex#
- threshold#
- whiteList#
- setAllowList(entities: list)#
Sets the whiteList parameter value.
- Parameters:
entities (list) – The list of white-listed entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setAllowList(["PER", "LOC"])
- setBlackList(value)#
Sets If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels
- Parameters:
value (List[str]) – If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels
- setCaseSensitive(value)#
Determines whether the definitions of the white listed and black listed entities are case sensitive or not.
- Parameters:
value (bool) – Whether white listed and black listed entities are case sensitive or not. Default: True.
- setCriteria(value)#
It is used to compare black and white listed values with the result of the Annotation.
Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.
isin : Filter by the chunk regex : Filter by using a regex
- Parameters:
value (string) – It is used to compare black and white listed values with the result of the Annotation. Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.
- setDenyList(entities)#
Sets the blackList parameter value.
- Parameters:
entities (list) – The list of black-listed entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setDenyList(["PER", "LOC"])
- setDoExceptionHandling(value: bool)#
If true, exceptions are handled.
If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one.
This comes with a performance penalty.
- Parameters:
value (bool) – True or False to handle exceptions.
- setFilterValue(value)#
Sets possible values ‘result’ or ‘entity’.
If the value is ‘result’, It filters according to the result of the Annotation. If the value is ‘entity’, It filters according to the entity field in the metadata of the Annotation.
- Parameters:
value (string) – possible values are ‘result’ and ‘entity’.
- setGreedyMode(greedy_mode: bool)#
Sets the greedyMode parameter value.
- Parameters:
greedy_mode (bool) – True or False to ignore B tags for contiguous tokens of same entity same.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setGreedyMode(True)
- setIgnoreStopWords(stopwords: list)#
Sets the ignoreStopWords parameter value.
- Parameters:
stopwords (list) – List of stop words to ignore.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setIgnoreStopWords(["a", "an", "the"])
- setPreservePosition(preserve_position: bool)#
Sets the preservePosition parameter value.
- Parameters:
preserve_position (bool) – True or False to preserve the original position of the tokens in the original document or use the modified tokens.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setPreservePosition(True)
- setRegex(value)#
Sets If defined, list of regex to process the chunks.
- Parameters:
value (List[str]) – If defined, list of regex to process the chunks
- setThreshold(threshold: float)#
Sets the threshold parameter value.
- Parameters:
threshold (float) – Confidence threshold to filter the chunk entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setThreshold(0.5)
- setWhiteList(value)#
Sets If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
- Parameters:
value (List[str]) – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
- class NerConverterInternal#
Bases:
sparknlp_jsl.common.AnnotatorApproach
,CommonNerConverterInternalParams
,sparknlp_jsl.annotator.source_tracking_metadata_params.SourceTrackingMetadataParams
Converts IOB or IOB2 representations of entities to a more user-friendly format.
This is the AnnotatorApproach version of the NerConverterInternalModel annotator.
Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered.
This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList, greedyMode, threshold, and ignoreStopWords that are not available in the NerConverter annotator.
See also Inside–outside–beginning (tagging) for more information.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, NAMED_ENTITY
CHUNK
- Parameters:
replaceDictResource – If defined, path to a dictionary file to replace the tokens with.
replaceLabels – If defined, a dictionary that maps old to new labels.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text") >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["sentence", "token"])\ ... .setOutputCol("embeddings") >>> nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ ... .setInputCols(["sentence", "token", "embeddings"]) \ ... .setOutputCol("ner") >>> nerConverter = NerConverterInternal() \ ... .setInputCols(["sentence", "token", "ner"]) \ ... .setOutputCol("ner_chunk") ... >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... embeddings, ... nerModel, ... nerConverter])
- allPossibleFieldsToStandardize#
- blackList#
- caseSensitive#
- criteria#
- doExceptionHandling#
- filterValue#
- getter_attrs = []#
- greedyMode#
- ignoreStopWords#
- includeOutputColumn#
- includeStandardField#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- name = 'NerConverterInternal'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- outputColumnKey#
- preservePosition#
- regex#
- replaceDictResource#
- replaceLabels#
- resetSentenceIndices#
- standardFieldKey#
- threshold#
- uid#
- whiteList#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setAllPossibleFieldsToStandardize(fields)#
Sets array with all possible fields containing the value to write in the standard field ordered by priority
- Parameters:
fields (list) – array with all possible fields containing the value to write in the standard field ordered by priority
- setAllowList(entities: list)#
Sets the whiteList parameter value.
- Parameters:
entities (list) – The list of white-listed entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setAllowList(["PER", "LOC"])
- setBlackList(value)#
Sets If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels
- Parameters:
value (List[str]) – If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels
- setCaseSensitive(value)#
Determines whether the definitions of the white listed and black listed entities are case sensitive or not.
- Parameters:
value (bool) – Whether white listed and black listed entities are case sensitive or not. Default: True.
- setCriteria(value)#
It is used to compare black and white listed values with the result of the Annotation.
Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.
isin : Filter by the chunk regex : Filter by using a regex
- Parameters:
value (string) – It is used to compare black and white listed values with the result of the Annotation. Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.
- setDenyList(entities)#
Sets the blackList parameter value.
- Parameters:
entities (list) – The list of black-listed entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setDenyList(["PER", "LOC"])
- setDoExceptionHandling(value: bool)#
If true, exceptions are handled.
If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one.
This comes with a performance penalty.
- Parameters:
value (bool) – True or False to handle exceptions.
- setFilterValue(value)#
Sets possible values ‘result’ or ‘entity’.
If the value is ‘result’, It filters according to the result of the Annotation. If the value is ‘entity’, It filters according to the entity field in the metadata of the Annotation.
- Parameters:
value (string) – possible values are ‘result’ and ‘entity’.
- setGreedyMode(greedy_mode: bool)#
Sets the greedyMode parameter value.
- Parameters:
greedy_mode (bool) – True or False to ignore B tags for contiguous tokens of same entity same.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setGreedyMode(True)
- setIgnoreStopWords(stopwords: list)#
Sets the ignoreStopWords parameter value.
- Parameters:
stopwords (list) – List of stop words to ignore.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setIgnoreStopWords(["a", "an", "the"])
- setIncludeOutputColumn(p)#
Sets whether to include a metadata key/value to specify the output column name for the annotation
- Parameters:
p (bool) – whether to include a metadata key/value to specify the output column name for the annotation
- setIncludeStandardField(p)#
Sets whether to include a metadata key/value to specify the output column name for the annotation
- Parameters:
p (bool) – whether to include a metadata key/value to specify the output column name for the annotation
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setOutputColumnKey(s)#
Set key name for the source column value
- Parameters:
s (str) – key name for the source column value
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setPreservePosition(preserve_position: bool)#
Sets the preservePosition parameter value.
- Parameters:
preserve_position (bool) – True or False to preserve the original position of the tokens in the original document or use the modified tokens.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setPreservePosition(True)
- setRegex(value)#
Sets If defined, list of regex to process the chunks.
- Parameters:
value (List[str]) – If defined, list of regex to process the chunks
- setReplaceDictResource(path: str, read_as: str = ReadAs.TEXT, options: dict = None)#
Sets the replaceDictResource parameter.
The method sets the replaceDictResource parameter to the path of the external resource. Usual usage is to set the path to a CSV file containing the dictionary pairs, one substitution per line, with the original label and the new label separated by a comma.
- Parameters:
path (str) – Path to the external resource
read_as (str, optional) – How to read the resource, by default “TEXT”. Possible values: “TEXT” (ReadAs.TEXT), “SPARK” (ReadAs.SPARK) or “BINARY” (ReadAs.BINARY)
options (dict, optional) – Options for reading the resource, by default None
Examples
Reading a CSV file
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("replaced_ner_chunk")\ ... .setReplaceDictResource("path/to/dictionary.csv", "TEXT", {"delimiter": ","})
Reading a TSV file
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence","token", "jsl_ner"])\ ... .setOutputCol("replaced_ner_chunk")\ ... .setReplaceDictResource("path/to/dictionary.tsv", "TEXT", {"delimiter": " "})
- setReplaceLabels(labels: dict)#
Sets dictionary that maps old to new labels
- Parameters:
labels (dict[str, str]) – Dictionary which maps old to new labels
Examples
Change the labels PER to PERSON and LOC to LOCATION:
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setReplaceLabels({"PER": "PERSON", "LOC": "LOCATION"})
Change ner_jsl entities for drugs.
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "jsl_ner"])\ ... .setOutputCol("replaced_ner_chunk")\ ... .setReplaceLabels( ... { ... "Drug_BrandName": "Drug", ... "Frequency": "Drug_Frequency", ... "Dosage": "Drug_Dosage", ... "Strength": "Drug_Strength", ... })
- setResetSentenceIndices(value)#
Set whether to reset sentence indices to treat the entire output as if it originates from a single document.
When set to true, the metadata of each entity will be updated by assigning the sentence key a value of 0, effectively treating the entire output as if it originates from a single document. regardless of the original sentence boundaries. Default: False.
- Parameters:
value (bool) – If set to true, sentence indices will be reset to treat the entire output as if it originates from a single document.
- setStandardFieldKey(s)#
Set key name for the source column value
- Parameters:
s (str) – key name for the source column value
- setThreshold(threshold: float)#
Sets the threshold parameter value.
- Parameters:
threshold (float) – Confidence threshold to filter the chunk entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setThreshold(0.5)
- setWhiteList(value)#
Sets If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
- Parameters:
value (List[str]) – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class NerConverterInternalModel(classname='com.johnsnowlabs.nlp.annotators.ner.NerConverterInternalModel', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal
,CommonNerConverterInternalParams
,sparknlp_jsl.annotator.source_tracking_metadata_params.SourceTrackingMetadataParams
Converts IOB or IOB2 representations of entities to a more user-friendly format.
Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, NAMED_ENTITY
CHUNK
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text") >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["sentence", "token"])\ ... .setOutputCol("embeddings") >>> nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ ... .setInputCols(["sentence", "token", "embeddings"]) \ ... .setOutputCol("ner") >>> nerConverter = NerConverterInternal() \ ... .setInputCols(["sentence", "token", "ner"]) \ ... .setOutputCol("ner_chunk") ... >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... embeddings, ... nerModel, ... nerConverter])
- blackList#
- caseSensitive#
- criteria#
- doExceptionHandling#
- filterValue#
- getter_attrs = []#
- greedyMode#
- ignoreStopWords#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- name = 'NerConverterInternalModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- preservePosition#
- regex#
- skipLPInputColsValidation = True#
- threshold#
- uid#
- whiteList#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setAllowList(entities: list)#
Sets the whiteList parameter value.
- Parameters:
entities (list) – The list of white-listed entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setAllowList(["PER", "LOC"])
- setBlackList(value)#
Sets If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels
- Parameters:
value (List[str]) – If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels
- setCaseSensitive(value)#
Determines whether the definitions of the white listed and black listed entities are case sensitive or not.
- Parameters:
value (bool) – Whether white listed and black listed entities are case sensitive or not. Default: True.
- setCriteria(value)#
It is used to compare black and white listed values with the result of the Annotation.
Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.
isin : Filter by the chunk regex : Filter by using a regex
- Parameters:
value (string) – It is used to compare black and white listed values with the result of the Annotation. Possible values are the following: ‘isin’, ‘regex’. Default: ‘isin’.
- setDenyList(entities)#
Sets the blackList parameter value.
- Parameters:
entities (list) – The list of black-listed entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setDenyList(["PER", "LOC"])
- setDoExceptionHandling(value: bool)#
If true, exceptions are handled.
If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one.
This comes with a performance penalty.
- Parameters:
value (bool) – True or False to handle exceptions.
- setFilterValue(value)#
Sets possible values ‘result’ or ‘entity’.
If the value is ‘result’, It filters according to the result of the Annotation. If the value is ‘entity’, It filters according to the entity field in the metadata of the Annotation.
- Parameters:
value (string) – possible values are ‘result’ and ‘entity’.
- setForceInputTypeValidation(etfm)#
- setGreedyMode(greedy_mode: bool)#
Sets the greedyMode parameter value.
- Parameters:
greedy_mode (bool) – True or False to ignore B tags for contiguous tokens of same entity same.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setGreedyMode(True)
- setIgnoreStopWords(stopwords: list)#
Sets the ignoreStopWords parameter value.
- Parameters:
stopwords (list) – List of stop words to ignore.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setIgnoreStopWords(["a", "an", "the"])
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setPreservePosition(preserve_position: bool)#
Sets the preservePosition parameter value.
- Parameters:
preserve_position (bool) – True or False to preserve the original position of the tokens in the original document or use the modified tokens.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setPreservePosition(True)
- setRegex(value)#
Sets If defined, list of regex to process the chunks.
- Parameters:
value (List[str]) – If defined, list of regex to process the chunks
- setThreshold(threshold: float)#
Sets the threshold parameter value.
- Parameters:
threshold (float) – Confidence threshold to filter the chunk entities.
Examples
>>> ner_converter = NerConverterInternal()\ ... .setInputCols(["sentence", "token", "ner"])\ ... .setOutputCol("ner_chunk")\ ... .setThreshold(0.5)
- setWhiteList(value)#
Sets If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
- Parameters:
value (List[str]) – If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.