sparknlp_jsl.finance.chunk_classification.resolution.sentence_entity_resolver
#
Module Contents#
Classes#
Contains all the parameters and methods to train a SentenceEntityResolverModel. |
|
Contains all the parameters and methods to train a SentenceEntityResolverModel. |
- class SentenceEntityResolverApproach#
Bases:
sparknlp_jsl.annotator.resolution.sentence_entity_resolver.SentenceEntityResolverApproach
Contains all the parameters and methods to train a SentenceEntityResolverModel.
The model transforms a dataset with Input Annotation type SENTENCE_EMBEDDINGS, coming from e.g. [BertSentenceEmbeddings](/docs/en/transformers#bertsentenceembeddings) and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)
Input Annotation types
Output Annotation type
SENTENCE_EMBEDDINGS
ENTITY
- Parameters:
labelCol – Column name for the value we are trying to resolve
normalizedCol – Column name for the original, normalized description
pretrainedModelPath – Path to an already trained SentenceEntityResolverModel, which is used as a starting point for training the new model. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
overrideExistingCodes – Whether to override the existing codes with new data while continue the training from a pretrained model. Default value is false(keep all the codes).
returnCosineDistances – Extract Cosine Distances. TRUE or False
aux_label_col – Auxiliary label which maps resolved entities to additional labels
useAuxLabel – Use AuxLabel Col or not
overrideExistingCodes – Whether to override the codes present in a pretrained model with new codes when the training process begins with a pretrained model
dropCodesList – A list of codes in a pretrained model that will be omitted when the training process begins with a pretrained model
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") >>> sentenceDetector = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence") >>> tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("token") >>> bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased") \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("embeddings") >>> snomedTrainingPipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... bertEmbeddings, ... ]) >>> snomedTrainingModel = snomedTrainingPipeline.fit(data) >>> snomedData = snomedTrainingModel.transform(data).cache() >>> assertionModel = assertionPipeline.fit(data) >>> assertionModel = assertionPipeline.fit(data)
>>> bertExtractor = SentenceEntityResolverApproach() \ ... .setNeighbours(25) \ ... .setThreshold(1000) \ ... .setInputCols(["bert_embeddings"]) \ ... .setNormalizedCol("normalized_text") \ ... .setLabelCol("label") \ ... .setOutputCol("snomed_code") \ ... .setDistanceFunction("EUCLIDIAN") \ ... .setCaseSensitive(False)
>>> snomedModel = bertExtractor.fit(snomedData)
- aux_label_col#
- caseSensitive#
- confidenceFunction#
- datasetInfo#
- distanceFunction#
- doExceptionHandling#
- dropCodesList#
- engine#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- labelCol#
- lazyAnnotator#
- missAsEmpty#
- neighbours#
- normalizedCol#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'entity'#
- outputCol#
- overrideExistingCodes#
- pretrainedModelPath#
- returnCosineDistances#
- returnResolvedTextEmbeddings#
- skipLPInputColsValidation = True#
- threshold#
- uid = ''#
- useAuxLabel#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getCaseSensitive()#
Gets whether to ignore case in tokens for embeddings matching.
- Returns:
Whether to ignore case in tokens for embeddings matching
- Return type:
bool
- getEngine()#
- Returns:
Deep Learning engine used for this model”
- Return type:
str
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setAuxLabelCol(name: str)#
Sets auxiliary label which maps resolved entities to additional labels
- Parameters:
name (str) – Auxiliary label which maps resolved entities to additional labels
- setCaseSensitive(value)#
Sets whether to ignore case in tokens for embeddings matching.
- Parameters:
value (bool) – Whether to ignore case in tokens for embeddings matching
- setConfidenceFunction(conf_function: str)#
What function to use to calculate confidence: Either ‘INVERSE’ or ‘SOFTMAX’.
- Parameters:
conf_function (str) – What function to use to calculate confidence: Either ‘INVERSE’ or ‘SOFTMAX’.
- setDatasetInfo(info: str)#
Sets descriptive information about the dataset being used.
- Parameters:
info (str) – Descriptive information about the dataset being used.
- setDistanceFunction(dist: str)#
Sets distance function to use for Word Mover’s Distance (WMD): ‘EUCLIDEAN’ or ‘COSINE’.
- Parameters:
dist (str) – Value that selects what distance function to use for WMD: ‘EUCLIDEAN’ or ‘COSINE’.
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setDropCodesList(value: list)#
Sets list of codes to leave out when continue training with new data.
- Parameters:
value (list) – List of codes in a pretrained model to leave out when continue training with new data.
- setExtractCosineDistances(name: bool)#
Extract Cosine Distances. True or False.
- Parameters:
name (bool) – Extract Cosine Distances. True or False
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLabelCol(name: str)#
Sets column name for the value we are trying to resolve
- Parameters:
name (str) – Column name for the value we are trying to resolve
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMissAsEmpty(value: bool)#
Sets whether or not to return an empty annotation on unmatched chunks.
- Parameters:
value (bool) – whether or not to return an empty annotation on unmatched chunks.
- setNeighbours(k: int)#
Sets the number of neighbours to consider in the KNN algorithmn.
- Parameters:
k (int) – Number of neighbours to consider in the KNN query to calculate Word Mover’s Distance (WMD).
- setNormalizedCol(name: str)#
Sets column name for the original, normalized description
- Parameters:
name (str) – Column name for the original, normalized description
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setOverrideExistingCodes(value: bool)#
Sets whether to override the existing codes with new data while continue the training from a pretrained model.
- Parameters:
value (bool) – Whether to override the existing codes with new data while continue the training from a pretrained model. Default value is false(keep all the codes).
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setPretrainedModelPath(path: str)#
Sets path to an already trained SentenceEntityResolverModel.
- Parameters:
path (str) – Path to an already trained SentenceEntityResolverModel, which is used as a starting point for training the new model.
- setReturnResolvedTextEmbeddings(value: bool)#
Sets whether to include embeddings for resolved text candidates.
If set to true, embeddings will be included; if set to false, embeddings will be excluded. (Default : False)
- Parameters:
value (bool) – Whether to include embeddings for resolved text embeddings.
- setThreshold(thres: float)#
Sets Threshold value for the last distance calculated.
- Parameters:
thres (float) – Threshold value for the last distance calculated.
- setUseAuxLabel(name: bool)#
Sets Use AuxLabel Col or not.
- Parameters:
name (bool) – Use AuxLabel Col or not.
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class SentenceEntityResolverModel(classname='com.johnsnowlabs.finance.chunk_classification.resolution.SentenceEntityResolverModel', java_model=None)#
Bases:
sparknlp_jsl.annotator.resolution.sentence_entity_resolver.SentenceEntityResolverModel
Contains all the parameters and methods to train a SentenceEntityResolverModel.
The model transforms a dataset with Input Annotation type SENTENCE_EMBEDDINGS, coming from e.g. [BertSentenceEmbeddings](/docs/en/transformers#bertsentenceembeddings) and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)
Input Annotation types
Output Annotation type
SENTENCE_EMBEDDINGS
ENTITY
- Parameters:
returnCosineDistances – Extract Cosine Distances. TRUE or False
aux_label_col – Auxiliary label which maps resolved entities to additional labels
useAuxLabel – Use AuxLabel Col or not
searchTree – Search tree for resolution
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") >>> sentenceDetector = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence") >>> tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("token") >>> bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased") \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("embeddings") >>> snomedTrainingPipeline = Pipeline(stages=[ ... documentAssembler, ... sentenceDetector, ... bertEmbeddings, ... ]) >>> snomedTrainingModel = snomedTrainingPipeline.fit(data) >>> snomedData = snomedTrainingModel.transform(data).cache() >>> assertionModel = assertionPipeline.fit(data) >>> assertionModel = assertionPipeline.fit(data)
>>> bertExtractor = SentenceEntityResolverApproach() \ ... .setNeighbours(25) \ ... .setThreshold(1000) \ ... .setInputCols(["bert_embeddings"]) \ ... .setNormalizedCol("normalized_text") \ ... .setLabelCol("label") \ ... .setOutputCol("snomed_code") \ ... .setDistanceFunction("EUCLIDIAN") \ ... .setCaseSensitive(False)
>>> snomedModel = bertExtractor.fit(snomedData)
- dimension#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- name = 'SentenceEntityResolverModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'entity'#
- outputCol#
- returnCosineDistances#
- searchTree#
- skipLPInputColsValidation = True#
- uid = ''#
- useAuxLabel#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getDimension()#
Gets embeddings dimension.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='finel_edgar_company_name', lang='en', remote_loc='finance/models')#
Download a pre-trained SentenceEntityResolverModel.
- Parameters:
name (str) – Name of the pre-trained model, by default “finel_edgar_company_name”
lang (str) – Language of the pre-trained model, by default “en”
remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.
- Returns:
A pre-trained SentenceEntityResolverModel.
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setDimension(value)#
Sets embeddings dimension.
- Parameters:
value (int) – Embeddings dimension
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setSearchTree(search_tree)#
Sets the search tree to use.
Usualy the user don’t need to set this parameter, as it uses an internal encapsulated search tree based on KDTree.
- Parameters:
search_tree (SerializableKDTree) – The search tree to use. Should be an instance of SerializableKDTree.
- setUseAuxLabel(name: bool)#
Sets Use AuxLabel Col or not.
- Parameters:
name (bool) – Use AuxLabel Col or not.
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.