sparknlp_jsl.annotator.chunker.chunk_key_phrase_extraction#

Module Contents#

Classes#

ChunkKeyPhraseExtraction

Extracts key phrases from texts.

class ChunkKeyPhraseExtraction(classname='com.johnsnowlabs.nlp.embeddings.ChunkKeyPhraseExtraction', java_model=None)#

Bases: sparknlp.annotator.embeddings.bert_sentence_embeddings.BertSentenceEmbeddings, sparknlp_jsl.common.HasEngine

Extracts key phrases from texts.

ChunkKeyPhraseExtraction uses BertSentenceEmbeddings to determine the most relevant key phrases describing a text with the use of two approaches: - By using cosine similarities between the embedding representation of the chunks and the embedding representation of the corresponding sentences/documents. - By using the Maximal Marginal Relevance (MMR) algorithm (set with the setDivergence method) to determine the most relevant key phrases.

If the selectMostDifferent parameter is set, return the key phrases that are the most different from each other (avoid too similar key phrases).

The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e., the document or the sentence they belong to). This allows, for example, to obtain a brief understanding of a document by selecting the most relevant phrases.

The input to the model consists of chunk annotations and sentence or document annotation. The input chunks can be generated in various ways: - Using NGramGenerator, which allows to obtain ranked n-gram chunks from the text (can be used to identify new entities). - Using YakeKeywordExtractor, allowing to rank the keywords extracted using the YAKE algorithm. - Using TextMatcher, which allows to rank the desired chunks from the annotator. - Using NerConverter, which allows to extract ranked named entities (which entities are the most relevant in the sentence/document).

The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

Input Annotation types

Output Annotation type

DOCUMENT, CHUNK

CHUNK

Parameters:
  • topN – The number of key phrases to select. Default: 3.

  • selectMostDifferent – Pre-select topN * 2 key phrases and out of those select the topN that are the most different from each other. This parameter should not be used in conjunction with divergence as they aim to achieve the same goal, but in different ways. Default: False.

  • divergence – The divergence value determines how different from each the extracted key phrases are. Possible values are within the interval [0, 1]. The higher the value is, the more divergence is enforced. A value of 0 means the key phrases are not compared to each other (no divergence is ensured) and their relevance is determined solely by their similarity to the document. This parameter should not be used if selectMostDifferent is set to True, as the two parameters aim to achieve the same goal in different ways. The default value is 0, meaning that the there is no constraint on the order of the extracted key phrases. The divergence is calculated using the Maximal Marginal Relevance measure. Default: 0.0.

  • documentLevelProcessing – A flag indicating whether to extract key phrases from the document level, i.e. from all the sentences available at a given row, rather than from the particular sentences the chunks refer to. Default: True.

  • concatenateSentences – A flag indicating whether to concatenate all input document/sentence annotations before computing their embedding. This parameter is only used if documentLevelProcessing is set to True. If concatenateSentences is set to True, then the model will concatenate the document/sentence input annotations and compute a single embedding. If it is set to False, then the model will compute the embedding of each sentence separately, and average the resulting embedding vectors in the end. Default: True.

  • dropPunctuation – This parameter indicates whether to remove punctuation marks from the input chunks. Chunks coming from NER models are not affected. Default: True.

Examples

>>> documenter = sparknlp.DocumentAssembler()     ...     .setInputCol("text")     ...     .setOutputCol("document")
...
>>> sentencer = sparknlp.annotators.SentenceDetector()     ...     .setInputCols(["document"])    ...     .setOutputCol("sentences")
...
>>> tokenizer = sparknlp.annotators.Tokenizer()     ...     .setInputCols(["document"])     ...     .setOutputCol("tokens")     ...
>>>  embeddings = sparknlp.annotators.WordEmbeddingsModel()     ...     .pretrained("embeddings_clinical", "en", "clinical/models")     ...     .setInputCols(["document", "tokens"])     ...     .setOutputCol("embeddings")
...
>>> ner_tagger = MedicalNerModel()     ...     .pretrained("ner_jsl_slim", "en", "clinical/models")     ...     .setInputCols(["sentences", "tokens", "embeddings"])     ...     .setOutputCol("ner_tags")
...
>>> ner_converter = NerConverter()    ...     .setInputCols("sentences", "tokens", "ner_tags")    ...     .setOutputCol("ner_chunks")
...
>>> key_phrase_extractor = ChunkKeyPhraseExtraction    ...     .pretrained()    ...     .setTopN(1)    ...     .setDocumentLevelProcessing(False)    ...     .setDivergence(0.4)    ...     .setInputCols(["sentences", "ner_chunks"])    ...     .setOutputCol("ner_chunk_key_phrases")
...
>>> pipeline = sparknlp.base.Pipeline()     ...     .setStages([documenter, sentencer, tokenizer, embeddings, ner_tagger, ner_converter, key_phrase_extractor])
...
>>> data = spark.createDataFrame([["Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."]]).toDF("text")
>>> results = pipeline.fit(data).transform(data)
>>> results    ...     .selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")    ...     .selectExpr(
...         "key_phrase.result",
...         "key_phrase.metadata.entity",
...         "key_phrase.metadata.DocumentSimilarity",
...         "key_phrase.metadata.MMRScore")    ...     .show(truncate=False)

result

DocumentSimilarity

MMRScore

gestational diabetes mellitus 28-year-old type two diabetes mellitus

0.7391447825527298 0.4366776288430703 0.7323921930094919

0.44348688715422274 0.13577881610104517 0.085800103824974

batchSize#
caseSensitive#
concatenateSentences#
configProtoBytes#
dimension#
divergence#
documentLevelProcessing#
dropPunctuation#
engine#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
isLong#
lazyAnnotator#
maxSentenceLength#
max_length_limit = 512#
name = ChunkKeyPhraseExtraction#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
selectMostDifferent#
storageRef#
topN#
clear(param)#

Clears a param from the param map if it has been explicitly set.

copy(extra=None)#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param)#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getBatchSize()#

Gets current batch size.

Returns:

Current batch size

Return type:

int

getCaseSensitive()#

Gets whether to ignore case in tokens for embeddings matching.

Returns:

Whether to ignore case in tokens for embeddings matching

Return type:

bool

getDimension()#

Gets embeddings dimension.

getEngine()#
Returns:

Deep Learning engine used for this model”

Return type:

str

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getMaxSentenceLength()#

Gets max sentence of the model.

Returns:

Max sentence length to process

Return type:

int

getOrDefault(param)#

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName)#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getStorageRef()#

Gets unique reference name for identification.

Returns:

Unique reference name for identification

Return type:

str

hasDefault(param)#

Checks whether a param has a default value.

hasParam(paramName)#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param)#

Checks whether a param is explicitly set by user or has a default value.

isSet(param)#

Checks whether a param is explicitly set by user.

classmethod load(path)#

Reads an ML instance from the input path, a shortcut of read().load(path).

static loadSavedModel(folder, spark_session)#

Loads a locally saved model.

Parameters:
Returns:

The restored model

Return type:

BertSentenceEmbeddings

static pretrained(name='sbert_jsl_medium_uncased', lang='en', remote_loc='clinical/models')#

Downloads and loads a pretrained model.

Parameters:
  • name (str, optional) – Name of the pretrained model.

  • lang (str, optional) – Language of the pretrained model, by default “en”

  • remote_loc (str, optional) – Optional remote address of the resource, by default “clinical/models”. Will use Spark NLPs repositories otherwise.

Returns:

The restored model

Return type:

ChunkKeyPhraseExtraction

classmethod read()#

Returns an MLReader instance for this class.

save(path)#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)#

Sets a parameter in the embedded param map.

setBatchSize(v)#

Sets batch size.

Parameters:

v (int) – Batch size

setCaseSensitive(value)#

Sets whether to ignore case in tokens for embeddings matching.

Parameters:

value (bool) – Whether to ignore case in tokens for embeddings matching

setConcatenateSentences(value: bool)#

Sets the concatenateSentences parameter.

If set to True, the model will concatenate the input sentence annotations before computing their embedding. If set to False, the model will compute the embedding of each sentence separately and then average them. The default value is True.

Parameters:

value (bool) – Whether to concatenate the input sentence/document annotations in order to compute the embedding of the whole document.

setConfigProtoBytes(b)#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

b (List[int]) – ConfigProto from tensorflow, serialized into byte array

setDimension(value)#

Sets embeddings dimension.

Parameters:

value (int) – Embeddings dimension

setDivergence(value: float)#

Sets the divergence value.

The divergence value determines how different from each the extracted key phrases are. The value must be in the the interval [0, 1]. The higher the value is, the more divergence is enforced. The default value is 0.0.

Parameters:

value (float) – Divergence value between 0.0 and 1.0.

setDocumentLevelProcessing(value: bool)#

Sets the documentLevelProcessing parameter.

If set to True, the model will extract key phrases from the whole document. If set to False, the model will extract key phrases from each sentence separately. The default value is True.

Parameters:

value (bool) – Whether to extract key phrases from the whole document(all sentences).

setDropPunctuation(value: bool)#

Sets the dropPunctuation parameter.

This parameter determines whether to remove punctuation marks from the input chunks. Chunks coming from NER models are not affected. The default value is True.

Parameters:

value (bool) – Whether to remove punctuation marks from input chunks.

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setIsLong(value)#

Sets whether to use Long type instead of Int type for inputs buffer.

Some Bert models require Long instead of Int.

Parameters:

value (bool) – Whether to use Long type instead of Int type for inputs buffer

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxSentenceLength(value)#

Sets max sentence length to process.

Note that a maximum limit exists depending on the model. If you are working with long single sequences, consider splitting up the input first with another annotator e.g. SentenceDetector.

Parameters:

value (int) – Max sentence length to process

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setSelectMostDifferent(value: bool)#

Sets the selectMostDifferent parameter.

If set to True, the model returns the top N key phrases which are the most different from each other. If set to False, the model returns the top N key phrases which have the highest score.

The default value is False.

Parameters:

value (bool) – whether to select the most different key phrases or not.

setStorageRef(value)#

Sets unique reference name for identification.

Parameters:

value (str) – Unique reference name for identification

setTopN(value: int)#

Set the number of key phrases to extract. The default value is 3.

Parameters:

value (int) – Number of key phrases to extract.

transform(dataset, params=None)#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write()#

Returns an MLWriter instance for this ML instance.