sparknlp_jsl.finance.sequence_generation.finance_qa#

Module Contents#

Classes#

FinanceQuestionAnswering

MedicalQuestionAnswering is a GPT based model for answering questions given a context. Unlike span based models, it

class FinanceQuestionAnswering(classname='com.johnsnowlabs.finance.sequence_generation.FinanceQuestionAnswering', java_model=None)#

Bases: sparknlp_jsl.annotator.MedicalQuestionAnswering

MedicalQuestionAnswering is a GPT based model for answering questions given a context. Unlike span based models, it generates the answers to the questions, rather than selecting phrases from the given context. The model is capable of answering various types of questions, including yes-no or full text ones.

Input Annotation types

Output Annotation type

DOCUMENT, DOCUMENT

CHUNK

Parameters:
  • questionType – Question type, e.g. “short” or “long”. The question types depend on the model, check the model card to get a description of question types

  • maxNewTokens – Maximum number of of new tokens to generate, by default 30

  • maxContextLength – Maximum length of context text

  • configProtoBytes – ConfigProto from tensorflow, serialized into byte array.

  • doSample – Whether or not to use sampling; use greedy decoding otherwise, by default False

  • topK – The number of highest probability vocabulary tokens to consider, by default 1

  • noRepeatNgramSize – The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0.

  • ignoreTokenIds – A list of token ids which are ignored in the decoder’s output, by default []

  • customPrompt – Custom prompt template. Available variables {QUESTION} and {CONTEXT}

  • useCache – Cache internal state of the model to improve performance, by default is True. Disable it only if you are processing short texts or you want to spare memory usage.

  • custom

Examples

>>> context = "This study aims to evaluate local failure patterns in node negative breast cancer patients ..."
>>> question = "Should chest wall irradiation be included after mastectomy and negative node breast cancer?"
>>> data = spark.createDataFrame([[question, context]]).toDF("question", "context")
>>> document_assembler = MultiDocumentAssembler()    ...   .setInputCols("question", "context")    ...   .setOutputCols("document_question", "document_context")
...
>>> med_qa = sparknlp_jsl.annotators.qa.MedicalQuestionAnswering    ...   .load("/models/sparknlp/medical_qa_biogpt")    ...   .setQuestionType("short")    ...   .setInputCols(["document_question", "document_context"])    ...   .setMaxNewTokens(100)    ...   .setOutputCol("answer")    >>> pipeline = Pipeline(stages=[document_assembler, med_qa])
>>> pipeline    ...   .fit(data)    ...   .select("answer.result")    ...   .show(truncate=False)
+-------+
|result |
+-------+
|[yes]  |
+-------+
batchSize#
configProtoBytes#
customPrompt#
doSample#
getter_attrs = []#
ignoreTokenIds#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
maxContextLength#
maxNewTokens#
mlFrameworkType#
modelType#
name = 'FinanceQuestionAnswering'#
noRepeatNgramSize#
optionalInputAnnotatorTypes = []#
outputAnnotatorType = 'document'#
outputCol#
questionType#
skipLPInputColsValidation = True#
topK#
uid = ''#
useCache#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getBatchSize()#

Gets current batch size.

Returns:

Current batch size

Return type:

int

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getQuestionTypes()#

Returns a list of question types

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static loadMedicalTextGenerator(med_text_generator_path, spark_session)#

Turns a MedicalTextGenerator into a MedicalQuestionAnswering model

Parameters:
  • med_text_generator_path (str) – Path to the MedicalTextGenerator model

  • spark_session (pyspark.sql.SparkSession) – The current SparkSession

Returns:

A MedicalQuestionAnswering model

Return type:

MedicalQuestionAnswering

static loadSavedModel(folder, spark_session)#

Loads a locally saved model.

Parameters:
Returns:

The restored model

Return type:

FinanceQuestionAnswering

static pretrained(name='biogpt_pubmed_qa', lang='en', remote_loc='finance/models')#

Downloads and loads a pretrained model.

Parameters:
  • name (str, optional) – Name of the pretrained model, by default “biogpt_pubmed_qa”

  • lang (str, optional) – Language of the pretrained model, by default “en”

  • remote_loc (str, optional) – Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

The restored model

Return type:

FinanceQuestionAnswering

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setBatchSize(v)#

Sets batch size.

Parameters:

v (int) – Batch size

setConfigProtoBytes(b)#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

b (List[int]) – ConfigProto from tensorflow, serialized into byte array

setCustomPrompt(value)#

Sets the custom prompt template. Available variables {QUESTION} and {CONTEXT}

Parameters:

value (str) – prompt template

setDoSample(value)#

Sets whether or not to use sampling, use greedy decoding otherwise.

Parameters:

value (bool) – Whether or not to use sampling; use greedy decoding otherwise

setForceInputTypeValidation(etfm)#
setIgnoreTokenIds(value)#

A list of token ids which are ignored in the decoder’s output.

Parameters:

value (List[int]) – The words to be filtered out

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxContextLength(value)#

Sets maximum length of output text.

Parameters:

value (int) – Maximum length of output text

setMaxNewTokens(value)#

Sets the maximum number of new tokens to be generated

Parameters:

value (int) – the maximum number of new tokens to be generated

setNoRepeatNgramSize(value)#

Sets size of n-grams that can only occur once.

If set to int > 0, all ngrams of that size can only occur once.

Parameters:

value (int) – N-gram size can only occur once

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setQuestionAnswerTerminals(questionTerminals)#

Set terminal symbols for each question type

Parameters:

questionTerminals (dict[str, list[int]]) –

setQuestionPrompts(questionPrompts)#

Set question prompts for each question type

Parameters:

questionPrompts (dict[str, list[str]]) –

setQuestionSkipLastToken(questionSkipLastToken)#

Set flag indicating whether to include the last symbol in the answer to each question type

Parameters:

questionSkipLastToken (dict[str, list[bool]]) –

setQuestionType(value)#

Sets the question type

Parameters:

value (str) – question type

setRandomSeed(seed)#

Sets random seed.

Parameters:

seed (int) – Random seed

setTopK(value)#

Sets the number of highest probability vocabulary tokens to consider

Parameters:

value (int) – Number of highest probability vocabulary tokens to consider

setUseCache(value)#

Cache internal state of the model to improve performance

Parameters:

value (bool) – Whether or not to use cache

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.