sparknlp_jsl.legal.sequence_generation.legal_qa
#
Module Contents#
Classes#
MedicalQuestionAnswering is a GPT based model for answering questions given a context. Unlike span based models, it |
- class LegalQuestionAnswering(classname='com.johnsnowlabs.legal.sequence_generation.LegalQuestionAnswering', java_model=None)#
Bases:
sparknlp_jsl.annotator.MedicalQuestionAnswering
MedicalQuestionAnswering is a GPT based model for answering questions given a context. Unlike span based models, it generates the answers to the questions, rather than selecting phrases from the given context. The model is capable of answering various types of questions, including yes-no or full text ones.
Input Annotation types
Output Annotation type
DOCUMENT, DOCUMENT
CHUNK
- Parameters:
questionType – Question type, e.g. “short” or “long”. The question types depend on the model, check the model card to get a description of question types
maxNewTokens – Maximum number of of new tokens to generate, by default 30
maxContextLength – Maximum length of context text
configProtoBytes – ConfigProto from tensorflow, serialized into byte array.
doSample – Whether or not to use sampling; use greedy decoding otherwise, by default False
topK – The number of highest probability vocabulary tokens to consider, by default 1
noRepeatNgramSize – The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0.
ignoreTokenIds – A list of token ids which are ignored in the decoder’s output, by default []
customPrompt – Custom prompt template. Available variables {QUESTION} and {CONTEXT}
useCache – Cache internal state of the model to improve performance, by default is True. Disable it only if you are processing short texts or you want to spare memory usage.
custom –
Examples
>>> context = "This study aims to evaluate local failure patterns in node negative breast cancer patients ..." >>> question = "Should chest wall irradiation be included after mastectomy and negative node breast cancer?" >>> data = spark.createDataFrame([[question, context]]).toDF("question", "context") >>> document_assembler = MultiDocumentAssembler() ... .setInputCols("question", "context") ... .setOutputCols("document_question", "document_context") ... >>> med_qa = sparknlp_jsl.annotators.qa.MedicalQuestionAnswering ... .load("/models/sparknlp/medical_qa_biogpt") ... .setQuestionType("short") ... .setInputCols(["document_question", "document_context"]) ... .setMaxNewTokens(100) ... .setOutputCol("answer") >>> pipeline = Pipeline(stages=[document_assembler, med_qa]) >>> pipeline ... .fit(data) ... .select("answer.result") ... .show(truncate=False) +-------+ |result | +-------+ |[yes] | +-------+
- batchSize#
- configProtoBytes#
- customPrompt#
- doSample#
- getter_attrs = []#
- ignoreTokenIds#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- maxContextLength#
- maxNewTokens#
- mlFrameworkType#
- modelType#
- name = 'LegalQuestionAnswering'#
- noRepeatNgramSize#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- questionType#
- skipLPInputColsValidation = True#
- topK#
- uid#
- useCache#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getBatchSize()#
Gets current batch size.
- Returns:
Current batch size
- Return type:
int
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getQuestionTypes()#
Returns a list of question types
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static loadMedicalTextGenerator(med_text_generator_path, spark_session)#
Turns a MedicalTextGenerator into a MedicalQuestionAnswering model
- Parameters:
med_text_generator_path (str) – Path to the MedicalTextGenerator model
spark_session (pyspark.sql.SparkSession) – The current SparkSession
- Returns:
A MedicalQuestionAnswering model
- Return type:
- static loadSavedModel(folder, spark_session)#
Loads a locally saved model.
- Parameters:
folder (str) – Folder of the saved model
spark_session (pyspark.sql.SparkSession) – The current SparkSession
- Returns:
The restored model
- Return type:
- static pretrained(name='biogpt_pubmed_qa', lang='en', remote_loc='legal/models')#
Downloads and loads a pretrained model.
- Parameters:
name (str, optional) – Name of the pretrained model, by default “biogpt_pubmed_qa”
lang (str, optional) – Language of the pretrained model, by default “en”
remote_loc (str, optional) – Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
The restored model
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setBatchSize(v)#
Sets batch size.
- Parameters:
v (int) – Batch size
- setConfigProtoBytes(b)#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
b (List[int]) – ConfigProto from tensorflow, serialized into byte array
- setCustomPrompt(value)#
Sets the custom prompt template. Available variables {QUESTION} and {CONTEXT}
- Parameters:
value (str) – prompt template
- setDoSample(value)#
Sets whether or not to use sampling, use greedy decoding otherwise.
- Parameters:
value (bool) – Whether or not to use sampling; use greedy decoding otherwise
- setForceInputTypeValidation(etfm)#
- setIgnoreTokenIds(value)#
A list of token ids which are ignored in the decoder’s output.
- Parameters:
value (List[int]) – The words to be filtered out
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxContextLength(value)#
Sets maximum length of output text.
- Parameters:
value (int) – Maximum length of output text
- setMaxNewTokens(value)#
Sets the maximum number of new tokens to be generated
- Parameters:
value (int) – the maximum number of new tokens to be generated
- setNoRepeatNgramSize(value)#
Sets size of n-grams that can only occur once.
If set to int > 0, all ngrams of that size can only occur once.
- Parameters:
value (int) – N-gram size can only occur once
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setQuestionAnswerTerminals(questionTerminals)#
Set terminal symbols for each question type
- Parameters:
questionTerminals (dict[str, list[int]]) –
- setQuestionPrompts(questionPrompts)#
Set question prompts for each question type
- Parameters:
questionPrompts (dict[str, list[str]]) –
- setQuestionSkipLastToken(questionSkipLastToken)#
Set flag indicating whether to include the last symbol in the answer to each question type
- Parameters:
questionSkipLastToken (dict[str, list[bool]]) –
- setQuestionType(value)#
Sets the question type
- Parameters:
value (str) – question type
- setRandomSeed(seed)#
Sets random seed.
- Parameters:
seed (int) – Random seed
- setTopK(value)#
Sets the number of highest probability vocabulary tokens to consider
- Parameters:
value (int) – Number of highest probability vocabulary tokens to consider
- setUseCache(value)#
Cache internal state of the model to improve performance
- Parameters:
value (bool) – Whether or not to use cache
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.