sparknlp_jsl.legal.token_classification.ner.legal_ner
#
Module Contents#
Classes#
Trains generic NER model based on Neural Networks. |
|
This Named Entity recognition annotator is a generic NER model based on |
- class LegalNerApproach#
Bases:
sparknlp_jsl.annotator.ner.medical_ner.MedicalNerApproach
Trains generic NER model based on Neural Networks.
The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
For instantiated/pretrained models, see
NerDLModel
.The training data should be a labeled Spark Dataset, in the format of
CoNLL
2003 IOB with Annotation type columns. The data should have columns of typeDOCUMENT, TOKEN, WORD_EMBEDDINGS
and an additional label column of annotator typeNAMED_ENTITY
.Excluding the label, this can be done with for example:
a SentenceDetector,
a Tokenizer and
a WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).
For extended examples of usage, see the Spark NLP Workshop.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, WORD_EMBEDDINGS
NAMED_ENTITY
Notes
Both DocumentAssembler and SentenceDetector annotators are annotators that output the
DOCUMENT
annotation type. Thus, any of them can be used as the first annotators in a pipeline.- Parameters:
labelColumn (str) – Column with label per each token.
entities – Entities to recognize.
minEpochs – Minimum number of epochs to train, by default 0.
maxEpochs – Maximum number of epochs to train, by default 70.
verbose – Level of verbosity during training, by default 2.
randomSeed – Random seed
randomSeed – Random seed. Set to positive integer to get reproducible results, by default None.
lr – Learning Rate, by default 0.001.
po – Learning rate decay coefficient (time-based). This is used to calculate the decayed learning rate at each step as: lr = lr / (1 + po * epoch), meaning that the value of the learning rate is update on each epoch. By default 0.005.
batchSize – Batch size, by default 8.
dropout –
Dropout coefficient, by default 0.5.
The coefficient of the dropout layer. The value should be between 0.0 and 1.0. Internally, it is used by Tensorflow as: rate = 1.0 - dropout when adding a dropout layer on top of the recurrent layers.
graphFolder –
Folder path that contains external graph files. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
When instantiating the Tensorflow model, uses this folder to search for the adequate Tensorflow graph. The search is done usaing the name of the .pb file, which should be blstn_{ntags}_{embedding_dim}_{lstm_size}_{nchars}.pb. The, the verifications on the obtained file are: - Embedding dimension should be exactly the same as the one used to train the model. - Number of tags should be greather than or equal to the number of tags in the training data. - Number of chars should be greather than or equal to the number of chars in the training data.
The returned file will be the first one that satisfies all the conditions.
If the name of the file is ill-formed, errors will occur during training.
graphFile –
Path that contains the external graph file.
When specified, the provided file will be used, and no graph search will happen. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
pretrainedModelPath – Path to an already trained FinanceNerModel, which is used as a starting point for training the new model. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
configProtoBytes – ConfigProto from tensorflow, serialized into byte array.
useContrib – whether to use contrib LSTM Cells. Not compatible with Windows. Might slightly improve accuracy. By default True.
validationSplit –
Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off.
The validation dataset is randomly extracted from the training dataset before training starts. If the value is 0.0, then no validation will be performed (hold out data).
evaluationLogExtended – Whether logs for validation to be extended, by default False.
testDataset – Path to test dataset in parquet format. If set, the dataset will be used to calculate statistic on it during training.
includeConfidence –
Whether to include confidence scores in annotation metadata, by default False.
Setting this parameter to True will add the confidence score to the metadata of the NAMED_ENTITY annotation. In addition, if includeAllConfidenceScores is set to True, then the confidence scores of all the tags will be added to the metadata, otherwise only for the predicted tag (the one with maximum score).
includeAllConfidenceScores –
Whether to include confidence scores for all tags in annotation metadata or just the score of the predicted tag, by default False.
Needs the includeConfidence parameter to be set to True.
enableOutputLogs – Whether to use stdout in addition to Spark logs, by default False.
outputLogsPath – Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
enableMemoryOptimizer –
Whether to optimize for large datasets or not. Enabling this option can slow down training.
In practice, if set to True the training will iterate over the spark Data Frame and retrieve the batches from the Data Frame iterator. This can be slower than the default option as it has to collect the batches on evey bach for every epoch, but it can be useful if the dataset is too large to fit in memory.
If the training data can fit to memory, then it is recommended to set this option to False (default value).
tagsMapping – A map specifying how old tags are mapped to new ones as a list of comma-separated entities, where the first entity is the old tag and the second entity is the new tag. For example, if the map is set to [“OLDTAG,NEWTAG”, “B-PER,B-VIP”, “I-PER, I-VIP”], then all occurrences of “OLDTAG” will be mapped to “NEWTAG”, all occurrences of “B-PER” will be mapped to “B-VIP”, and all occurrences of “I-PER” will be mapped to “I-VIP”. It only works if setOverrideExistingTags is set to False.
earlyStoppingPatience –
Number of epochs to wait before early stopping if no improvement, by default 5.
Given the earlyStoppingCriterion, if the performance does not improve for the given number of epochs, then the training will stop. If the value is 0, then early stopping will occurs as soon as the criterion is met (no patience).
earlyStoppingCriterion –
If set, this param specifies the criterion to stop training if performance is not improving. Default value is 0 which is means that early stopping is not used.
The criterion is set to F1-score if the validationSplit is greater than 0.0 (F1-socre on validation set) or testDataset is defined (F1-score on test set), otherwise it is set to model loss. The priority is as follows: - If testDataset is defined, then the criterion is set to F1-score on test set. - If validationSplit is greater than 0.0, then the criterion is set to F1-score on validation set. - Otherwise, the criterion is set to model loss.
Note that while the F1-score ranges from 0.0 to 1.0, the loss ranges from 0.0 to infinity. So, depending on which case you are using, the value you use for the criterion can be very different. For example, if validationSplit is 0.1, then a criterion of 0.01 means that if the F1-score on the validation set difference from last epoch is greater than 0.01, then the training should stop. However, if there is not validation or test set defined, then a criterion of 2.0 means that if the loss difference between the last epoch and the current one is less than 2.0, then training should stop.
See also earlyStoppingPatience.
logPrefix – A prefix that will be appended to every log, default value is empty.
useBestModel –
Whether to restore and use the model from the epoch that has achieved the best performance at the end of the training. By default False (keep the model from teh last trained epoch).
The best model depends on the earlyStoppingCriterion, which can be F1-score on test/validation dataset or the value of loss.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline
First extract the prerequisites for the NerDLApproach
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> embeddings = BertEmbeddings.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings")
Then the training can start
>>> nerTagger = MedicalNerApproach() \ ... .setInputCols(["sentence", "token", "embeddings"]) \ ... .setLabelColumn("label") \ ... .setOutputCol("ner") \ ... .setMaxEpochs(1) \ ... .setRandomSeed(0) \ ... .setVerbose(0) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... embeddings, ... nerTagger ... ]) >>> conll = CoNLL() >>> trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train") >>> pipelineModel = pipeline.fit(trainingData)
- batchSize#
- configProtoBytes#
- datasetInfo#
- dropout#
- earlyStoppingCriterion#
- earlyStoppingPatience#
- enableMemoryOptimizer#
- enableOutputLogs#
- engine#
- entities#
- evaluationLogExtended#
- getter_attrs = []#
- graphFile#
- graphFolder#
- includeAllConfidenceScores#
- includeConfidence#
- inputAnnotatorTypes#
- inputCols#
- labelColumn#
- lazyAnnotator#
- logPrefix#
- lr#
- maxEpochs#
- minEpochs#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- outputLogsPath#
- overrideExistingTags#
- po#
- pretrainedModelPath#
- randomSeed#
- randomValidationSplitPerEpoch#
- sentenceTokenIndex#
- skipLPInputColsValidation = True#
- testDataset#
- uc#
- uid#
- useBestModel#
- useContrib#
- validationSplit#
- verbose#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getEngine()#
- Returns:
Deep Learning engine used for this model”
- Return type:
str
- getInputCols()#
Gets current column names of input annotations.
- getLabelColumn()#
Gets column for label per each token.
- Returns:
Column with label per each token
- Return type:
str
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setBatchSize(value: int)#
Sets batch size, by default 64.
- Parameters:
value (int) – Batch size
- setConfigProtoBytes(value: list)#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
value (List[str]) – ConfigProto from tensorflow, serialized into byte array
- setDatasetInfo(info: str)#
Sets descriptive information about the dataset being used. :param info: Descriptive information about the dataset being used. :type info: str
- setDoExceptionHandling(value: bool)#
If true, tries to compute batch-wise as usual. If within a batch an exception occurs, the batch will be processed row-wise and for any exception occurring during row-processing a Error Annotation is emitted. This means only the bad rows are lost instead of entire batch. This comes with a performance penalty.
value : bool
- setDropout(value: float)#
Sets dropout coefficient, by default 0.5.
- Parameters:
value (float) – Dropout coefficient
- setEarlyStoppingCriterion(criterion: float)#
Sets early stopping criterion. A value 0 means no early stopping.
- Parameters:
criterion (float) – Early stopping criterion.
- setEarlyStoppingPatience(patience: int)#
Sets the number of epochs with no performance improvement before training is terminated.
- Parameters:
patience (int) – Early stopping patience.
- setEnableMemoryOptimizer(value: bool)#
Sets Whether to optimize for large datasets or not, by default False. Enabling this option can slow down training.
- Parameters:
value (bool) – Whether to optimize for large datasets
- setEnableOutputLogs(value)#
Sets whether to use stdout in addition to Spark logs, by default False.
- Parameters:
value (bool) – Whether to use stdout in addition to Spark logs
- setEntities(tags)#
Sets entities to recognize.
- Parameters:
tags (List[str]) – List of entities
- setEvaluationLogExtended(v)#
Sets whether logs for validation to be extended, by default False. Displays time and evaluation of each label.
- Parameters:
v (bool) – Whether logs for validation to be extended
- setForceInputTypeValidation(etfm)#
- setGraphFile(value: str)#
Sets path that contains the external graph file. When specified, the provided file will be used, and no graph search will happen.
- Parameters:
value (str) – Path that contains the external graph file. When specified, the provided file will be used, and no graph search will happen.
- setGraphFolder(value: str)#
Sets folder path that contain external graph files.
- Parameters:
value (str) – Folder path that contain external graph files
- setIncludeAllConfidenceScores(value: bool)#
Sets whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False.
- Parameters:
value (bool) – Whether to include all confidence scores in annotation metadata or just the score of the predicted tag
- setIncludeConfidence(value: bool)#
Sets whether to include confidence scores in annotation metadata, by default False.
- Parameters:
value (bool) – Whether to include the confidence value in the output.
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLabelColumn(value)#
Sets name of column for data labels.
- Parameters:
value (str) – Column for data labels
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setLogPrefix(value: str)#
Sets the prefix of the training logs.
- Parameters:
value (str) – A prefix that will be appended to every log, default value is empty.
- setLr(value: float)#
Sets Learning Rate, by default 0.001.
- Parameters:
value (float) – Learning Rate
- setMaxEpochs(epochs)#
Sets maximum number of epochs to train.
- Parameters:
epochs (int) – Maximum number of epochs to train
- setMinEpochs(epochs)#
Sets minimum number of epochs to train.
- Parameters:
epochs (int) – Minimum number of epochs to train
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setOutputLogsPath(p)#
Sets folder path to save training logs
- Parameters:
p (str) – Folder path to save training logs
- setOverrideExistingTags(value: bool)#
Sets whether to override already learned tags when using a pretrained model to initialize the new model. Default is ‘true’
- Parameters:
value (bool) – Whether to override already learned tags when using a pretrained model to initialize the new model. Default is ‘true’
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setPo(value: float)#
Sets Learning rate decay coefficient, by default 0.005.
Real Learning Rage is lr / (1 + po * epoch).
- Parameters:
value (float) – Learning rate decay coefficient
- setPretrainedModelPath(value: str)#
Sets location of pretrained model.
- Parameters:
value (str) – Path to an already trained MedicalNerModel, which is used as a starting point for training the new model.
- setRandomSeed(seed)#
Sets random seed for shuffling.
- Parameters:
seed (int) – Random seed for shuffling
- setRandomValidationSplitPerEpoch(value: bool)#
Do a random validation split after each epoch rather than at the beginning of training only.
- Parameters:
value (bool) – Whether to do a random validation split after each epoch.
- setSentenceTokenIndex(value: bool)#
- Sets whether to include the token index for each sentence in annotation metadata,
by default False. If the value is True, the process might be slowed down.
- Parameters:
value (bool) – Whether to include the token index for each sentence in annotation metadata, by default False. If the value is True, the process might be slowed down.
- setTagsMapping(value: list)#
Sets a map specifying how old tags are mapped to new ones.
The format of the map should be a list of comma-separated entities, where the first entity is the old tag and the second entity is the new tag. For example, if the map is set to [“OLDTAG,NEWTAG”, “B-PER,B-VIP”, “I-PER, I-VIP”], then all occurrences of “OLDTAG” will be mapped to “NEWTAG”, all occurrences of “B-PER” will be mapped to “B-VIP”, and all occurrences of “I-PER” will be mapped to “I-VIP”.
It only works if setOverrideExistingTags is set to False.
- Parameters:
value (List[str]) – List of comma-separated entities.
- setTestDataset(path, read_as=ReadAs.SPARK, options={'format': 'parquet'})#
Path to a parquet file of a test dataset. If set, it is used to calculate statistics on it during training.
The parquet file must be a dataframe that has the same columns as the model that is being trained. For example, if the model needs as input DOCUMENT, TOKEN, WORD_EMBEDDINGS (Features) and NAMED_ENTITY (label) then these columns also need to be present while saving the dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe.
An example on how to create such a parquet file could be:
>>> # assuming preProcessingPipeline >>> (train, test) = data.randomSplit([0.8, 0.2]) >>> preProcessingPipeline ... .fit(test) ... .transform(test) ... .write ... .mode("overwrite") ... .parquet("test_data") >>> annotator.setTestDataset("test_data")
- Parameters:
path (str) – Path to test dataset
read_as (str, optional) – How to read the resource, by default ReadAs.SPARK
options (dict, optional) – Options for reading the resource, by default {“format”: “csv”}
- setUseBestModel(value: bool)#
Sets whether to return the model that has achieved the best metrics across epochs.
- Parameters:
value (bool) – Whether to return the model that has achieved the best metrics across epochs.
- setUseContrib(value: bool)#
Sets whether to use contrib LSTM Cells. Not compatible with Windows. Might slightly improve accuracy.
- Parameters:
value (bool) – Whether to use contrib LSTM Cells
- Raises:
Exception – Windows not supported to use contrib
- setValidationSplit(value: float)#
Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.
- Parameters:
value (float) – Proportion of training dataset to be validated
- setVerbose(value)#
Sets level of verbosity during training
- Parameters:
value (int) – Level of verbosity
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class LegalNerModel(classname='com.johnsnowlabs.legal.token_classification.ner.LegalNerModel', java_model=None)#
Bases:
sparknlp_jsl.annotator.ner.medical_ner.MedicalNerModel
This Named Entity recognition annotator is a generic NER model based on Neural Networks.
Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
This is the instantiated model of the
NerDLApproach
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> nerModel = MedicalNerDLModel.pretrained() \ ... .setInputCols(["sentence", "token", "embeddings"]) \ ... .setOutputCol("ner")
The default model is
"ner_dl"
, if no name is provided.For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.
Note that some pretrained models require specific types of embeddings, depending on which they were trained on. For example, the default model
"ner_dl"
requires the WordEmbeddings"glove_100d"
.For extended examples of usage, see the Spark NLP Workshop.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, WORD_EMBEDDINGS
NAMED_ENTITY
- Parameters:
batchSize – Size of every batch, by default 8
configProtoBytes – ConfigProto from tensorflow, serialized into byte array.
includeConfidence – Whether to include confidence scores in annotation metadata, by default False
includeAllConfidenceScores – Whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False
inferenceBatchSize – Number of sentences to process in a single batch during inference
classes – Tags used to trained this NerDLModel
labelCasing – Setting all labels of the NER models upper/lower case. values upper|lower
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("bert") >>> nerTagger = MedicalNerDLModel.pretrained() \ ... .setInputCols(["sentence", "token", "bert"]) \ ... .setOutputCol("ner") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... embeddings, ... nerTagger ... ]) >>> data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text") >>> result = pipeline.fit(data).transform(data)
- classes#
- configProtoBytes#
- datasetInfo#
- doExceptionHandling#
- getter_attrs = []#
- includeAllConfidenceScores#
- includeConfidence#
- inferenceBatchSize#
- inputAnnotatorTypes#
- inputCols#
- labelCasing#
- lazyAnnotator#
- name = 'LegalNerModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- sentenceTokenIndex#
- skipLPInputColsValidation = True#
- storageRef#
- trainingClassDistribution#
- uid#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getClasses()#
Returns the list of entities which are recognized
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getStorageRef()#
Gets unique reference name for identification.
- Returns:
Unique reference name for identification
- Return type:
str
- getTrainingClassDistribution()#
Gets the class counts for each of the classes during training.
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static loadSavedModel(ner_model_path, folder, spark_session)#
Load a pre-trained MedicalNerModel.
- Parameters:
ner_model_path (str) – Path to the pre-trained model.
folder (str) – Folder where the Tensorflow model is located.
spark_session (SparkSession) – SparkSession.
- Returns:
A pre-trained MedicalNerModel.
- Return type:
- static pretrained(name='legner_contract_doc_parties_lg', lang='en', remote_loc='legal/models')#
Downloads and loads a pretrained model.
- Parameters:
name (str, optional) – Name of the pretrained model, by default “legner_contract_doc_parties_lg”
lang (str, optional) – Language of the pretrained model, by default “en”
remote_loc (str, optional) – Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
The restored model
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setConfigProtoBytes(conf: list)#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
conf (List[str]) – ConfigProto from tensorflow, serialized into byte array
- setDatasetInfo(info: str)#
Sets descriptive information about the dataset being used. :param info: Descriptive information about the dataset being used. :type info: str
- setForceInputTypeValidation(etfm)#
- setIncludeConfidence(value: bool)#
Sets whether to include confidence scores in annotation metadata, by default False.
- Parameters:
value (bool) – Whether to include the confidence value in the output.
- setInferenceBatchSize(value: int)#
Sets number of sentences to process in a single batch during inference
- Parameters:
value (int) – number of sentences to process in a single batch during inference
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLabelCasing(value: str)#
Setting all labels of the NER models upper/lower case. values upper|lower
- Parameters:
value (str) – Setting all labels of the NER models upper/lower case. values upper|lower
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setSentenceTokenIndex(value: bool)#
Sets whether to include the token index for each sentence in annotation metadata, by default False. If the value is True, the process might be slowed down.
- Parameters:
value (bool) – Whether to include the token index for each sentence in annotation metadata, by default False. If the value is True, the process might be slowed down.
- setStorageRef(value)#
Sets unique reference name for identification.
- Parameters:
value (str) – Unique reference name for identification
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.