sparknlp_jsl.annotator.resolution.sentence_entity_resolver#

Module Contents#

Classes#

SentenceEntityResolverApproach

Trains a SentenceEntityResolverModel.

SentenceEntityResolverModel

Extracts entities from sentence embeddings and resolves them to a particular ontology / curated dataset.

SentenceResolverParams

Common interface for the Sentence Resolver family.

class SentenceEntityResolverApproach#

Bases: sparknlp_jsl.common.AnnotatorApproachInternal, SentenceResolverParams, sparknlp_jsl.common.HasEngine

Trains a SentenceEntityResolverModel.

The SentenceEntityResolverModel maps sentence embeddings to entities in a knowledge base. To train a custom model, you need to provide a dataset with the following columns:

  • label: Entity name

  • chunk: Occurrence of the entity in the text, without standartization

  • sentence_embeddings: Sentence embeddings from, e.g., the BertSentenceEmbeddings annotator.

Optionally, you can also provide the aux_label column, containing auxiliary label which maps resolved entities to additional labels. If you have ground truth of the knowledge base entities, setting this column will help the model to learn better.

To continue the training of an already trained model, you can use the pretrainedModelPath parameter. This will load the pretrained model and continue the training process. To override the codes in the pretrained model, set the overrideExistingCodes parameter to True. To define a list of unwanted labels in the pretrained model, set the dropCodesList parameter with the list of labels to ignore.

You can find pretrained Sentence Embeddings (using BERT or other flavour) in the NLP Models Hub.

Input Annotation types

Output Annotation type

SENTENCE_EMBEDDINGS

ENTITY

Parameters:
  • labelCol (str) – Column name for the value we are trying to resolve. Usually this contains the entity ID in the knowledge base (e.g., the ICD-10 code).

  • normalizedCol (str) – Column name for the original, normalized description

  • pretrainedModelPath (str) – Path to an already trained SentenceEntityResolverModel. This pretrained model will be used as a starting point for training the new one. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • overrideExistingCodes (bool) – Whether to override the existing codes with new data while continue the training from a pretrained model. Default value is false(keep all the codes).

  • returnCosineDistances (bool) – Extract Cosine Distances. True or False

  • aux_label_col (str) – Auxiliary label which maps resolved entities to additional labels

  • useAuxLabel (bool) – Whether to use the auxiliary column or not. Default value is False.

  • overrideExistingCodes – Whether to override the codes present in a pretrained model with new codes when the training process begins with a pretrained model

  • dropCodesList (list) – A list of codes in a pretrained model that will be omitted when the training process begins with a pretrained model.

Examples

Let data be a spark DataFrame with the required columns:
  • conceptId: Concept ID of the knowledge base (SNOMED). Will be used as label column.

  • ground_truth: Ground truth of the conceptId present in the knowledge base. Can be used as auxiliary column.

  • concept_name: Chunk of text identified in the text data that we want to map to the knowledge base.

conceptId

ground_truth

concept_name

108367008

Dislocation of joint

Dislocation of joint

3384011000036100

Arthrotec

Arthrotec

166717003

Serum creatinine

Serum creatinine

3877011000036101

Lipitor

Lipitor

402234004

Foot eczema

Foot eczema

Then, we can train a SentenceEntityResolverModel as follows:

>>> documentAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
>>> tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
>>> bertEmbeddings = (
...     nlp.BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
...     .setInputCols(["sentence"])
...     .setOutputCol("embeddings")
... )
>>> data_pipeline = nlp.Pipeline(stages=[
...    documentAssembler,
...    sentenceDetector,
...    bertEmbeddings,
... ])
>>> data_processing_model = data_pipeline.fit(data)
>>> prepared_data = data_processing_model.transform(data)
>>> bertExtractor = (
...     medical.SentenceEntityResolverApproach()
...     .setNeighbours(25)
...     .setThreshold(1000)
...     .setInputCols(["bert_embeddings"])
...     .setNormalizedCol("normalized_text")
...     .setLabelCol("label")
...     .setOutputCol("snomed_code")
...     .setDistanceFunction("EUCLIDIAN")
...     .setCaseSensitive(False)
...     .setUseAuxLabel(True)
...     .setAuxLabelCol("ground_truth")
...     )
>>> snomedModel = bertExtractor.fit(prepared_data)
aux_label_col#
caseSensitive#
confidenceFunction#
datasetInfo#
distanceFunction#
doExceptionHandling#
dropCodesList#
engine#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
labelCol#
lazyAnnotator#
missAsEmpty#
neighbours#
normalizedCol#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
overrideExistingCodes#
pretrainedModelPath#
returnCosineDistances#
returnResolvedTextEmbeddings#
skipLPInputColsValidation = True#
threshold#
uid#
useAuxLabel#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getCaseSensitive()#

Gets whether to ignore case in tokens for embeddings matching.

Returns:

Whether to ignore case in tokens for embeddings matching

Return type:

bool

getEngine()#
Returns:

Deep Learning engine used for this model”

Return type:

str

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setAuxLabelCol(name: str)#

Sets auxiliary label which maps resolved entities to additional labels

Parameters:

name (str) – Auxiliary label which maps resolved entities to additional labels

setCaseSensitive(value)#

Sets whether to ignore case in tokens for embeddings matching.

Parameters:

value (bool) – Whether to ignore case in tokens for embeddings matching

setConfidenceFunction(conf_function: str)#

What function to use to calculate confidence: Either ‘INVERSE’ or ‘SOFTMAX’.

Parameters:

conf_function (str) – What function to use to calculate confidence: Either ‘INVERSE’ or ‘SOFTMAX’.

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used.

Parameters:

info (str) – Descriptive information about the dataset being used.

setDistanceFunction(dist: str)#

Sets distance function to use for Word Mover’s Distance (WMD): ‘EUCLIDEAN’ or ‘COSINE’.

Parameters:

dist (str) – Value that selects what distance function to use for WMD: ‘EUCLIDEAN’ or ‘COSINE’.

setDoExceptionHandling(value: bool)#

If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Parameters:

value (bool) – If True, exceptions are handled.

setDropCodesList(value: list)#

Sets list of codes to leave out when continue training with new data.

Parameters:

value (list) – List of codes in a pretrained model to leave out when continue training with new data.

setExtractCosineDistances(name: bool)#

Extract Cosine Distances. True or False.

Parameters:

name (bool) – Extract Cosine Distances. True or False

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLabelCol(name: str)#

Sets column name for the value we are trying to resolve

Parameters:

name (str) – Column name for the value we are trying to resolve

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMissAsEmpty(value: bool)#

Sets whether or not to return an empty annotation on unmatched chunks.

Parameters:

value (bool) – whether or not to return an empty annotation on unmatched chunks.

setNeighbours(k: int)#

Sets the number of neighbours to consider in the KNN algorithmn.

Parameters:

k (int) – Number of neighbours to consider in the KNN query to calculate Word Mover’s Distance (WMD).

setNormalizedCol(name: str)#

Sets column name for the original, normalized description

Parameters:

name (str) – Column name for the original, normalized description

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setOverrideExistingCodes(value: bool)#

Sets whether to override the existing codes with new data while continue the training from a pretrained model.

Parameters:

value (bool) – Whether to override the existing codes with new data while continue the training from a pretrained model. Default value is false(keep all the codes).

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setPretrainedModelPath(path: str)#

Sets path to an already trained SentenceEntityResolverModel.

Parameters:

path (str) – Path to an already trained SentenceEntityResolverModel, which is used as a starting point for training the new model.

setReturnResolvedTextEmbeddings(value: bool)#

Sets whether to include embeddings for resolved text candidates.

If set to true, embeddings will be included; if set to false, embeddings will be excluded. (Default : False)

Parameters:

value (bool) – Whether to include embeddings for resolved text embeddings.

setThreshold(thres: float)#

Sets Threshold value for the last distance calculated.

Parameters:

thres (float) – Threshold value for the last distance calculated.

setUseAuxLabel(name: bool)#

Sets Use AuxLabel Col or not.

Parameters:

name (bool) – Use AuxLabel Col or not.

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class SentenceEntityResolverModel(classname='com.johnsnowlabs.nlp.annotators.resolution.SentenceEntityResolverModel', java_model=None)#

Bases: sparknlp_jsl.common.AnnotatorModelInternal, sparknlp_jsl.common.HasEmbeddingsProperties, sparknlp_jsl.common.HasStorageModel, SentenceResolverParams, sparknlp_jsl.common.HasEngine

Extracts entities from sentence embeddings and resolves them to a particular ontology / curated dataset.

Transforms a dataset with Input Annotation type SENTENCE_EMBEDDINGS, coming from e.g. [BertSentenceEmbeddings](/docs/en/transformers#bertsentenceembeddings) and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.).

For a list of pretrained models, please see the NLP Models Hub.

Input Annotation types

Output Annotation type

SENTENCE_EMBEDDINGS

ENTITY

Parameters:
  • returnCosineDistances (bool) – Whether to extract Cosine Distances. Either True or False.

  • useAuxLabel (bool) – Whether to use the auxiliary column or not. Either True or False

  • searchTree (StructFeature_HadoopFix) –

    An encapsulated instance of the SerializableKDTree class. The search tree is used to find the nearest neighbours considering the distance function by using the multidimensional binary search tree approch.

    This parameter is used internally by the SentenceEntityResolverModel, and usually the user don’t need to set it.

    Reference: > Jon Louis Bentley. 1975. > Multidimensional binary search trees used for associative searching. > Commun. ACM 18, 9 (Sept. 1975), 509–517. > https://doi.org/10.1145/361002.361007

Examples

>>> documentAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
>>> tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
>>> bertEmbeddings = (
...     nlp.BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
...     .setInputCols(["sentence"])
...     .setOutputCol("embeddings")
... )
>>> bertExtractor = (
...     medical.SentenceEntityResolverModel
...     .pretrained("sbertresolve_icd10cm_slim_billable_hcc_med", "en", "clinical/models")
...     .setInputCols(["embeddings"])
...     .setOutputCol("icd10_code")
...     .setDistanceFunction("EUCLIDEAN")
... )
>>> icd10ExtractorPipeline = nlp.Pipeline(stages=[
...    documentAssembler,
...    sentenceDetector,
...    bertEmbeddings,
...    bertExtractor
... ])
>>> empty_data = spark.createDataFrame([[""]]).toDF("text")
>>> icd10Model = icd10ExtractorPipeline.fit(empty_data)
>>> results = icd10Model.transform(data)
dimension#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
name = 'SentenceEntityResolverModel'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
returnCosineDistances#
searchTree#
skipLPInputColsValidation = True#
uid#
useAuxLabel#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getDimension()#

Gets embeddings dimension.

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='sbiobertresolve_icd10cm_augmented_billable_hcc', lang='en', remote_loc='clinical/models')#

Downloads and loads a pretrained model.

Parameters:
  • name (str, optional) – Name of the pretrained model, by default “sbiobertresolve_icd10cm_augmented_billable_hcc”

  • lang (str, optional) – Language of the pretrained model, by default “en”

  • remote_loc (str, optional) – Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

The restored model

Return type:

SentenceEntityResolverModel

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setDimension(value)#

Sets embeddings dimension.

Parameters:

value (int) – Embeddings dimension

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setSearchTree(search_tree)#

Sets the search tree to use.

Usualy the user don’t need to set this parameter, as it uses an internal encapsulated search tree based on KDTree.

Parameters:

search_tree (SerializableKDTree) – The search tree to use. Should be an instance of SerializableKDTree.

setUseAuxLabel(name: bool)#

Sets Use AuxLabel Col or not.

Parameters:

name (bool) – Use AuxLabel Col or not.

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class SentenceResolverParams#

Bases: sparknlp_jsl.common.HasCaseSensitiveProperties, sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams

Common interface for the Sentence Resolver family.

Parameters:
  • distanceFunction (str) – What distance function to use for Word Mover’s Distance (WMD). Either ‘EUCLIDEAN’ or ‘COSINE’.

  • neighbours (int) – Number of neighbours to consider in the KNN algorithmm query to calculate Word Mover’s Distance (WMD).

  • threshold (float) – Threshold value for the distance calculated in the search tree. This represents the maximum distance between the query and the nearest neighbour. Neighbours with a distance greater than the threshold will not be returned.

  • confidenceFunction (str) – What function to use to calculate confidence. Either ‘INVERSE’ or ‘SOFTMAX’.

  • missAsEmpty (bool) – Whether or not to return an empty annotation on unmatched chunks.

  • returnResolvedTextEmbeddings (bool) – Whether to include embeddings for resolved text embeddings.(Default : False)

  • datasetInfo – Descriptive information about the dataset being used.

caseSensitive#
confidenceFunction#
datasetInfo#
distanceFunction#
doExceptionHandling#
inputAnnotatorTypes#
missAsEmpty#
neighbours#
returnResolvedTextEmbeddings#
threshold#
getCaseSensitive()#

Gets whether to ignore case in tokens for embeddings matching.

Returns:

Whether to ignore case in tokens for embeddings matching

Return type:

bool

setCaseSensitive(value)#

Sets whether to ignore case in tokens for embeddings matching.

Parameters:

value (bool) – Whether to ignore case in tokens for embeddings matching

setConfidenceFunction(conf_function: str)#

What function to use to calculate confidence: Either ‘INVERSE’ or ‘SOFTMAX’.

Parameters:

conf_function (str) – What function to use to calculate confidence: Either ‘INVERSE’ or ‘SOFTMAX’.

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used.

Parameters:

info (str) – Descriptive information about the dataset being used.

setDistanceFunction(dist: str)#

Sets distance function to use for Word Mover’s Distance (WMD): ‘EUCLIDEAN’ or ‘COSINE’.

Parameters:

dist (str) – Value that selects what distance function to use for WMD: ‘EUCLIDEAN’ or ‘COSINE’.

setDoExceptionHandling(value: bool)#

If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Parameters:

value (bool) – If True, exceptions are handled.

setMissAsEmpty(value: bool)#

Sets whether or not to return an empty annotation on unmatched chunks.

Parameters:

value (bool) – whether or not to return an empty annotation on unmatched chunks.

setNeighbours(k: int)#

Sets the number of neighbours to consider in the KNN algorithmn.

Parameters:

k (int) – Number of neighbours to consider in the KNN query to calculate Word Mover’s Distance (WMD).

setReturnResolvedTextEmbeddings(value: bool)#

Sets whether to include embeddings for resolved text candidates.

If set to true, embeddings will be included; if set to false, embeddings will be excluded. (Default : False)

Parameters:

value (bool) – Whether to include embeddings for resolved text embeddings.

setThreshold(thres: float)#

Sets Threshold value for the last distance calculated.

Parameters:

thres (float) – Threshold value for the last distance calculated.