sparknlp_jsl.legal.token_classification.ner.legal_ner#

Module Contents#

Classes#

LegalNerApproach

Trains generic NER model based on Neural Networks.

LegalNerModel

This Named Entity recognition annotator is a generic NER model based on

class LegalNerApproach#

Bases: sparknlp_jsl.annotator.ner.medical_ner.MedicalNerApproach

Trains generic NER model based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

For instantiated/pretrained models, see NerDLModel.

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY.

Excluding the label, this can be done with for example:

  • a SentenceDetector,

  • a Tokenizer and

  • a WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, WORD_EMBEDDINGS

NAMED_ENTITY

Notes

Both DocumentAssembler and SentenceDetector annotators are annotators that output the DOCUMENT annotation type. Thus, any of them can be used as the first annotators in a pipeline.

Parameters:
  • labelColumn (str) – Column with label per each token.

  • entities – Entities to recognize.

  • minEpochs – Minimum number of epochs to train, by default 0.

  • maxEpochs – Maximum number of epochs to train, by default 70.

  • verbose – Level of verbosity during training, by default 2.

  • randomSeed – Random seed

  • randomSeed – Random seed. Set to positive integer to get reproducible results, by default None.

  • lr – Learning Rate, by default 0.001.

  • po – Learning rate decay coefficient (time-based). This is used to calculate the decayed learning rate at each step as: lr = lr / (1 + po * epoch), meaning that the value of the learning rate is update on each epoch. By default 0.005.

  • batchSize – Batch size, by default 8.

  • dropout

    Dropout coefficient, by default 0.5.

    The coefficient of the dropout layer. The value should be between 0.0 and 1.0. Internally, it is used by Tensorflow as: rate = 1.0 - dropout when adding a dropout layer on top of the recurrent layers.

  • graphFolder

    Folder path that contains external graph files. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

    When instantiating the Tensorflow model, uses this folder to search for the adequate Tensorflow graph. The search is done usaing the name of the .pb file, which should be blstn_{ntags}_{embedding_dim}_{lstm_size}_{nchars}.pb. The, the verifications on the obtained file are: - Embedding dimension should be exactly the same as the one used to train the model. - Number of tags should be greather than or equal to the number of tags in the training data. - Number of chars should be greather than or equal to the number of chars in the training data.

    The returned file will be the first one that satisfies all the conditions.

    If the name of the file is ill-formed, errors will occur during training.

  • graphFile

    Path that contains the external graph file.

    When specified, the provided file will be used, and no graph search will happen. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • pretrainedModelPath – Path to an already trained FinanceNerModel, which is used as a starting point for training the new model. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • configProtoBytes – ConfigProto from tensorflow, serialized into byte array.

  • useContrib – whether to use contrib LSTM Cells. Not compatible with Windows. Might slightly improve accuracy. By default True.

  • validationSplit

    Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off.

    The validation dataset is randomly extracted from the training dataset before training starts. If the value is 0.0, then no validation will be performed (hold out data).

  • evaluationLogExtended – Whether logs for validation to be extended, by default False.

  • testDataset – Path to test dataset in parquet format. If set, the dataset will be used to calculate statistic on it during training.

  • includeConfidence

    Whether to include confidence scores in annotation metadata, by default False.

    Setting this parameter to True will add the confidence score to the metadata of the NAMED_ENTITY annotation. In addition, if includeAllConfidenceScores is set to True, then the confidence scores of all the tags will be added to the metadata, otherwise only for the predicted tag (the one with maximum score).

  • includeAllConfidenceScores

    Whether to include confidence scores for all tags in annotation metadata or just the score of the predicted tag, by default False.

    Needs the includeConfidence parameter to be set to True.

  • enableOutputLogs – Whether to use stdout in addition to Spark logs, by default False.

  • outputLogsPath – Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • enableMemoryOptimizer

    Whether to optimize for large datasets or not. Enabling this option can slow down training.

    In practice, if set to True the training will iterate over the spark Data Frame and retrieve the batches from the Data Frame iterator. This can be slower than the default option as it has to collect the batches on evey bach for every epoch, but it can be useful if the dataset is too large to fit in memory.

    If the training data can fit to memory, then it is recommended to set this option to False (default value).

  • tagsMapping – A map specifying how old tags are mapped to new ones as a list of comma-separated entities, where the first entity is the old tag and the second entity is the new tag. For example, if the map is set to [“OLDTAG,NEWTAG”, “B-PER,B-VIP”, “I-PER, I-VIP”], then all occurrences of “OLDTAG” will be mapped to “NEWTAG”, all occurrences of “B-PER” will be mapped to “B-VIP”, and all occurrences of “I-PER” will be mapped to “I-VIP”. It only works if setOverrideExistingTags is set to False.

  • earlyStoppingPatience

    Number of epochs to wait before early stopping if no improvement, by default 5.

    Given the earlyStoppingCriterion, if the performance does not improve for the given number of epochs, then the training will stop. If the value is 0, then early stopping will occurs as soon as the criterion is met (no patience).

  • earlyStoppingCriterion

    If set, this param specifies the criterion to stop training if performance is not improving. Default value is 0 which is means that early stopping is not used.

    The criterion is set to F1-score if the validationSplit is greater than 0.0 (F1-socre on validation set) or testDataset is defined (F1-score on test set), otherwise it is set to model loss. The priority is as follows: - If testDataset is defined, then the criterion is set to F1-score on test set. - If validationSplit is greater than 0.0, then the criterion is set to F1-score on validation set. - Otherwise, the criterion is set to model loss.

    Note that while the F1-score ranges from 0.0 to 1.0, the loss ranges from 0.0 to infinity. So, depending on which case you are using, the value you use for the criterion can be very different. For example, if validationSplit is 0.1, then a criterion of 0.01 means that if the F1-score on the validation set difference from last epoch is greater than 0.01, then the training should stop. However, if there is not validation or test set defined, then a criterion of 2.0 means that if the loss difference between the last epoch and the current one is less than 2.0, then training should stop.

    See also earlyStoppingPatience.

  • logPrefix – A prefix that will be appended to every log, default value is empty.

  • useBestModel

    Whether to restore and use the model from the epoch that has achieved the best performance at the end of the training. By default False (keep the model from teh last trained epoch).

    The best model depends on the earlyStoppingCriterion, which can be F1-score on test/validation dataset or the value of loss.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline

First extract the prerequisites for the NerDLApproach

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> embeddings = BertEmbeddings.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings")

Then the training can start

>>> nerTagger = MedicalNerApproach() \
...     .setInputCols(["sentence", "token", "embeddings"]) \
...     .setLabelColumn("label") \
...     .setOutputCol("ner") \
...     .setMaxEpochs(1) \
...     .setRandomSeed(0) \
...     .setVerbose(0)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     embeddings,
...     nerTagger
... ])
>>> conll = CoNLL()
>>> trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
>>> pipelineModel = pipeline.fit(trainingData)
batchSize#
configProtoBytes#
datasetInfo#
dropout#
earlyStoppingCriterion#
earlyStoppingPatience#
enableMemoryOptimizer#
enableOutputLogs#
engine#
entities#
evaluationLogExtended#
getter_attrs = []#
graphFile#
graphFolder#
includeAllConfidenceScores#
includeConfidence#
inputAnnotatorTypes#
inputCols#
labelColumn#
lazyAnnotator#
logPrefix#
lr#
maxEpochs#
minEpochs#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
outputLogsPath#
overrideExistingTags#
po#
pretrainedModelPath#
randomSeed#
randomValidationSplitPerEpoch#
sentenceTokenIndex#
skipLPInputColsValidation = True#
testDataset#
uc#
uid#
useBestModel#
useContrib#
validationSplit#
verbose#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getEngine()#
Returns:

Deep Learning engine used for this model”

Return type:

str

getInputCols()#

Gets current column names of input annotations.

getLabelColumn()#

Gets column for label per each token.

Returns:

Column with label per each token

Return type:

str

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setBatchSize(value: int)#

Sets batch size, by default 64.

Parameters:

value (int) – Batch size

setConfigProtoBytes(value: list)#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

value (List[str]) – ConfigProto from tensorflow, serialized into byte array

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used. :param info: Descriptive information about the dataset being used. :type info: str

setDoExceptionHandling(value: bool)#

If true, tries to compute batch-wise as usual. If within a batch an exception occurs, the batch will be processed row-wise and for any exception occurring during row-processing a Error Annotation is emitted. This means only the bad rows are lost instead of entire batch. This comes with a performance penalty.

value : bool

setDropout(value: float)#

Sets dropout coefficient, by default 0.5.

Parameters:

value (float) – Dropout coefficient

setEarlyStoppingCriterion(criterion: float)#

Sets early stopping criterion. A value 0 means no early stopping.

Parameters:

criterion (float) – Early stopping criterion.

setEarlyStoppingPatience(patience: int)#

Sets the number of epochs with no performance improvement before training is terminated.

Parameters:

patience (int) – Early stopping patience.

setEnableMemoryOptimizer(value: bool)#

Sets Whether to optimize for large datasets or not, by default False. Enabling this option can slow down training.

Parameters:

value (bool) – Whether to optimize for large datasets

setEnableOutputLogs(value)#

Sets whether to use stdout in addition to Spark logs, by default False.

Parameters:

value (bool) – Whether to use stdout in addition to Spark logs

setEntities(tags)#

Sets entities to recognize.

Parameters:

tags (List[str]) – List of entities

setEvaluationLogExtended(v)#

Sets whether logs for validation to be extended, by default False. Displays time and evaluation of each label.

Parameters:

v (bool) – Whether logs for validation to be extended

setForceInputTypeValidation(etfm)#
setGraphFile(value: str)#

Sets path that contains the external graph file. When specified, the provided file will be used, and no graph search will happen.

Parameters:

value (str) – Path that contains the external graph file. When specified, the provided file will be used, and no graph search will happen.

setGraphFolder(value: str)#

Sets folder path that contain external graph files.

Parameters:

value (str) – Folder path that contain external graph files

setIncludeAllConfidenceScores(value: bool)#

Sets whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False.

Parameters:

value (bool) – Whether to include all confidence scores in annotation metadata or just the score of the predicted tag

setIncludeConfidence(value: bool)#

Sets whether to include confidence scores in annotation metadata, by default False.

Parameters:

value (bool) – Whether to include the confidence value in the output.

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLabelColumn(value)#

Sets name of column for data labels.

Parameters:

value (str) – Column for data labels

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setLogPrefix(value: str)#

Sets the prefix of the training logs.

Parameters:

value (str) – A prefix that will be appended to every log, default value is empty.

setLr(value: float)#

Sets Learning Rate, by default 0.001.

Parameters:

value (float) – Learning Rate

setMaxEpochs(epochs)#

Sets maximum number of epochs to train.

Parameters:

epochs (int) – Maximum number of epochs to train

setMinEpochs(epochs)#

Sets minimum number of epochs to train.

Parameters:

epochs (int) – Minimum number of epochs to train

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setOutputLogsPath(p)#

Sets folder path to save training logs

Parameters:

p (str) – Folder path to save training logs

setOverrideExistingTags(value: bool)#

Sets whether to override already learned tags when using a pretrained model to initialize the new model. Default is ‘true’

Parameters:

value (bool) – Whether to override already learned tags when using a pretrained model to initialize the new model. Default is ‘true’

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setPo(value: float)#

Sets Learning rate decay coefficient, by default 0.005.

Real Learning Rage is lr / (1 + po * epoch).

Parameters:

value (float) – Learning rate decay coefficient

setPretrainedModelPath(value: str)#

Sets location of pretrained model.

Parameters:

value (str) – Path to an already trained MedicalNerModel, which is used as a starting point for training the new model.

setRandomSeed(seed)#

Sets random seed for shuffling.

Parameters:

seed (int) – Random seed for shuffling

setRandomValidationSplitPerEpoch(value: bool)#

Do a random validation split after each epoch rather than at the beginning of training only.

Parameters:

value (bool) – Whether to do a random validation split after each epoch.

setSentenceTokenIndex(value: bool)#
Sets whether to include the token index for each sentence in annotation metadata,

by default False. If the value is True, the process might be slowed down.

Parameters:

value (bool) – Whether to include the token index for each sentence in annotation metadata, by default False. If the value is True, the process might be slowed down.

setTagsMapping(value: list)#

Sets a map specifying how old tags are mapped to new ones.

The format of the map should be a list of comma-separated entities, where the first entity is the old tag and the second entity is the new tag. For example, if the map is set to [“OLDTAG,NEWTAG”, “B-PER,B-VIP”, “I-PER, I-VIP”], then all occurrences of “OLDTAG” will be mapped to “NEWTAG”, all occurrences of “B-PER” will be mapped to “B-VIP”, and all occurrences of “I-PER” will be mapped to “I-VIP”.

It only works if setOverrideExistingTags is set to False.

Parameters:

value (List[str]) – List of comma-separated entities.

setTestDataset(path, read_as=ReadAs.SPARK, options={'format': 'parquet'})#

Path to a parquet file of a test dataset. If set, it is used to calculate statistics on it during training.

The parquet file must be a dataframe that has the same columns as the model that is being trained. For example, if the model needs as input DOCUMENT, TOKEN, WORD_EMBEDDINGS (Features) and NAMED_ENTITY (label) then these columns also need to be present while saving the dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe.

An example on how to create such a parquet file could be:

>>> # assuming preProcessingPipeline
>>> (train, test) = data.randomSplit([0.8, 0.2])
>>> preProcessingPipeline
...     .fit(test)
...     .transform(test)
...     .write
...     .mode("overwrite")
...     .parquet("test_data")
>>> annotator.setTestDataset("test_data")
Parameters:
  • path (str) – Path to test dataset

  • read_as (str, optional) – How to read the resource, by default ReadAs.SPARK

  • options (dict, optional) – Options for reading the resource, by default {“format”: “csv”}

setUseBestModel(value: bool)#

Sets whether to return the model that has achieved the best metrics across epochs.

Parameters:

value (bool) – Whether to return the model that has achieved the best metrics across epochs.

setUseContrib(value: bool)#

Sets whether to use contrib LSTM Cells. Not compatible with Windows. Might slightly improve accuracy.

Parameters:

value (bool) – Whether to use contrib LSTM Cells

Raises:

Exception – Windows not supported to use contrib

setValidationSplit(value: float)#

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.

Parameters:

value (float) – Proportion of training dataset to be validated

setVerbose(value)#

Sets level of verbosity during training

Parameters:

value (int) – Level of verbosity

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class LegalNerModel(classname='com.johnsnowlabs.legal.token_classification.ner.LegalNerModel', java_model=None)#

Bases: sparknlp_jsl.annotator.ner.medical_ner.MedicalNerModel

This Named Entity recognition annotator is a generic NER model based on Neural Networks.

Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

This is the instantiated model of the NerDLApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> nerModel = MedicalNerDLModel.pretrained() \
...     .setInputCols(["sentence", "token", "embeddings"]) \
...     .setOutputCol("ner")

The default model is "ner_dl", if no name is provided.

For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.

Note that some pretrained models require specific types of embeddings, depending on which they were trained on. For example, the default model "ner_dl" requires the WordEmbeddings "glove_100d".

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, WORD_EMBEDDINGS

NAMED_ENTITY

Parameters:
  • batchSize – Size of every batch, by default 8

  • configProtoBytes – ConfigProto from tensorflow, serialized into byte array.

  • includeConfidence – Whether to include confidence scores in annotation metadata, by default False

  • includeAllConfidenceScores – Whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False

  • inferenceBatchSize – Number of sentences to process in a single batch during inference

  • classes – Tags used to trained this NerDLModel

  • labelCasing – Setting all labels of the NER models upper/lower case. values upper|lower

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("bert")
>>> nerTagger = MedicalNerDLModel.pretrained() \
...     .setInputCols(["sentence", "token", "bert"]) \
...     .setOutputCol("ner")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     embeddings,
...     nerTagger
... ])
>>> data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
classes#
configProtoBytes#
datasetInfo#
doExceptionHandling#
getter_attrs = []#
includeAllConfidenceScores#
includeConfidence#
inferenceBatchSize#
inputAnnotatorTypes#
inputCols#
labelCasing#
lazyAnnotator#
name = 'LegalNerModel'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
sentenceTokenIndex#
skipLPInputColsValidation = True#
storageRef#
trainingClassDistribution#
uid#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getClasses()#

Returns the list of entities which are recognized

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getStorageRef()#

Gets unique reference name for identification.

Returns:

Unique reference name for identification

Return type:

str

getTrainingClassDistribution()#

Gets the class counts for each of the classes during training.

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static loadSavedModel(ner_model_path, folder, spark_session)#

Load a pre-trained MedicalNerModel.

Parameters:
  • ner_model_path (str) – Path to the pre-trained model.

  • folder (str) – Folder where the Tensorflow model is located.

  • spark_session (SparkSession) – SparkSession.

Returns:

A pre-trained MedicalNerModel.

Return type:

MedicalNerModel

static pretrained(name='legner_contract_doc_parties_lg', lang='en', remote_loc='legal/models')#

Downloads and loads a pretrained model.

Parameters:
  • name (str, optional) – Name of the pretrained model, by default “legner_contract_doc_parties_lg”

  • lang (str, optional) – Language of the pretrained model, by default “en”

  • remote_loc (str, optional) – Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

The restored model

Return type:

LegalNerModel

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setConfigProtoBytes(conf: list)#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

conf (List[str]) – ConfigProto from tensorflow, serialized into byte array

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used. :param info: Descriptive information about the dataset being used. :type info: str

setForceInputTypeValidation(etfm)#
setIncludeConfidence(value: bool)#

Sets whether to include confidence scores in annotation metadata, by default False.

Parameters:

value (bool) – Whether to include the confidence value in the output.

setInferenceBatchSize(value: int)#

Sets number of sentences to process in a single batch during inference

Parameters:

value (int) – number of sentences to process in a single batch during inference

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLabelCasing(value: str)#

Setting all labels of the NER models upper/lower case. values upper|lower

Parameters:

value (str) – Setting all labels of the NER models upper/lower case. values upper|lower

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setSentenceTokenIndex(value: bool)#

Sets whether to include the token index for each sentence in annotation metadata, by default False. If the value is True, the process might be slowed down.

Parameters:

value (bool) – Whether to include the token index for each sentence in annotation metadata, by default False. If the value is True, the process might be slowed down.

setStorageRef(value)#

Sets unique reference name for identification.

Parameters:

value (str) – Unique reference name for identification

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.