sparknlp_jsl.annotator.normalizer.drug_normalizer#

Module Contents#

Classes#

DrugNormalizer

Normalizes drug mentions in clinical text.

class DrugNormalizer#

Bases: sparknlp_jsl.common.AnnotatorModelInternal

Normalizes drug mentions in clinical text.

Adds apaces between punctuation and words, as well as normalize the drug mentions. The policy parameter controls what drug information should be normalized, check the parameter documentation for details.

For usage examples and discussion, check out this blog post.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
  • lowercase – Whether to convert strings to lowercase

  • policy

    Policy to remove patterns from text. Defaults “all”.

    Possible values are “all”, “abbreviations”, or “dosages”. - “abbreviations” will replace all abbreviations with their full form (for example, replacing “oral sol” to “oral solution”). - “dosages” will replace all dosages to a standardized form (for example, replacing “10 million units” to “10000000 unt”). - “all” will replace both abbreviations and dosages.

Examples

>>> data = spark.createDataFrame([
...   ["Sodium Chloride/Potassium Chloride 13bag"],
...   ["interferon alfa-2b 10 million unit ( 1 ml ) injec"],
...   ["aspirin 10 meq/ 5 ml oral sol"]
... ]).toDF("text")
>>> document = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> drugNormalizer = DrugNormalizer().setInputCols(["document"]).setOutputCol("document_normalized")
>>> trainingPipeline = Pipeline(stages=[document, drugNormalizer])
>>> result = trainingPipeline.fit(data).transform(data)
>>> result.selectExpr("explode(document_normalized.result) as normalized_text").show(truncate=False)
+----------------------------------------------------+
|normalized_text                                     |
+----------------------------------------------------+
|Sodium Chloride / Potassium Chloride 13 bag         |
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|aspirin 2 meq/ml oral solution                      |
+----------------------------------------------------+
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
lowercase#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
policy#
skipLPInputColsValidation = True#
clear(param)#

Clears a param from the param map if it has been explicitly set.

copy(extra=None)#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param)#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)#

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName)#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param)#

Checks whether a param has a default value.

hasParam(paramName)#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param)#

Checks whether a param is explicitly set by user or has a default value.

isSet(param)#

Checks whether a param is explicitly set by user.

classmethod load(path)#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path)#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)#

Sets a parameter in the embedded param map.

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setLowercase(value: bool)#

Sets whether to convert strings to lowercase.

Parameters:

value (bool) – Whether to convert strings to lowercase

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setPolicy(value: str)#

Sets policy to remove patterns from text.

Possible values are “all”, “abbreviations”, or “dosages”. - “abbreviations” will replace all abbreviations with their full form (for example, replacing “oral sol” to “oral solution”). - “dosages” will replace all dosages to a standardized form (for example, replacing “10 million units” to “10000000 unt”). - “all” will replace both abbreviations and dosages.

Parameters:

value (str) – policy to remove patterns from text.

transform(dataset, params=None)#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write()#

Returns an MLWriter instance for this ML instance.