SynonymAugmentationUMLS

case class SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false) extends CheckLicense with Product with Serializable

Contains all methods to augment any given DataFrame in terms of a Combinatorial NER Synonym Matching using UMLS or SentenceEntityResolvers. The augment function takes a DataFrame and NER PipelineModel and augments it by exactly matching Named Entities through UMLS Synonym Relation or by using any SentenceEntityResolver's output metadata. UMLS META directory is expected to be in the FS if UMLS is used as SynonymSource; in case RESOLUTIONS are used as SynonymSource umlsMetaPath parameter is ignored. The DataFrame is expected to have two columns: an 'identification' column (hopefully unique for user's sake) and an 'information' text column. In case the augment function is called with augmentationMode=="chunk", the 'information' column should be the output of a chunk AnnotatorType. Pipeline is expected to have one single stage for each Annotator.

Example

Augmenting a simple sentence

Define or load an NER pipeline including a chunk AnnotatorType for your source data:

val doc = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val tkn = new Tokenizer().setInputCols("document").setOutputCol("token")
val embs = WordEmbeddingsModel.pretrained("embeddings_clinical", "en" , "clinical/models")
          .setInputCols("document","token").setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
          .setInputCols("document","token","embeddings").setOutputCol("ner")
val conv = new NerConverterInternal().setInputCols("document","token","ner").setOutputCol("ner_chunk")
val edf = ResourceHelper.spark.createDataFrame(Array(Tuple1(""))).toDF("text")
val plModel = new Pipeline().setStages(Array(doc,tkn,embs,ner,conv)).fit(edf)

Then we can create the augmenter object and call the augment function like the following:

val augmenter = SynonymAugmentationUMLS(ResourceHelper.spark, "src/test/resources/synonym-augmentation/mini_umls_meta", "id", "text")
val synsSimple = augmenter.augmentCsv(edf, plModel, "ENG", false, AugmentationModes.PLAIN_TEXT).cache()
syns.orderBy("code").show(1000, false)
print(syns.count()) // Will most probably exceed the original number of unique rows due to augmentation

Linear Supertypes

Serializable, Serializable, Product, Equals, CheckLicense, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

SynonymAugmentationUMLS
Serializable
Serializable
Product
Equals
CheckLicense
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Instance Constructors

new SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def applyCasingAugmentation(df: DataFrame, inputName: String, columnName: String, casingFunctions: Seq[String], delimiter: String = " "): DataFrame
final def asInstanceOf[T0]: T0

Definition Classes
Any
def augmentCsv(corpusCsvPath: String, nlpPipeline: PipelineModel, language: String = "ENG", doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, synonymSource: String = SynonymSources.UMLS, regexParsers: List[(String, String)] = ..., euclideanDistanceThreshold: Double = 10, cosineDistanceThreshold: Double = 0.25, synonymLimit: Int = 5, casingFunctions: List[String] = Seq(CasingFunctions.INFER).asJava): DataFrame
corpusCsvPath
path to the csv with identification and information column and any other column needed in the provided pipeline
nlpPipeline
SparkNLP Pipeline including all stages until the AnnotatorType for the AugmentationMode selected (CHUNK / ENTITY)
language
three letter upper-case code for UMLS language
doProduct
whether or not to combine all possible synonyms
augmentationMode
whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation
synonymSource
whether to pick the synonyms from UMLS or from SentenceEntityResolver's metadata
regexParsers
ordered regexp replace patterns to apply recursively to the Recognized Entity in order to find an exact match in UMLS
euclideanDistanceThreshold
max euclidean distance of a resolution to be considered in the augmentation
cosineDistanceThreshold
max cosine distance of a resolution to be considered in the augmentation
synonymLimit
max number of resolutions to be used as synonyms
casingFunctions
list of strings for additional casing augmentation to apply
returns
the augmented DataFrame
def augmentDataFrame(corpusDF: DataFrame, nlpPipeline: PipelineModel, language: String = "ENG", doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, synonymSource: String = SynonymSources.UMLS, regexParsers: List[(String, String)] = ..., euclideanDistanceThreshold: Double = 10, cosineDistanceThreshold: Double = 0.25, synonymLimit: Int = 5, casingFunctions: List[String] = Seq(CasingFunctions.INFER).asJava): DataFrame
corpusDF
actual DataFrame with identification and information column and any other column needed in the provided pipeline
nlpPipeline
SparkNLP Pipeline including all stages until the AnnotatorType for the AugmentationMode selected (CHUNK / ENTITY)
language
three letter upper-case code for UMLS language
doProduct
whether or not to combine all possible synonyms
augmentationMode
whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation
synonymSource
whether to pick the synonyms from UMLS or from SentenceEntityResolver's metadata
regexParsers
ordered regexp replace patterns to apply recursively to the Recognized Entity in order to find an exact match in UMLS
euclideanDistanceThreshold
max euclidean distance of a resolution to be considered in the augmentation
cosineDistanceThreshold
max cosine distance of a resolution to be considered in the augmentation
synonymLimit
max number of resolutions to be used as synonyms
casingFunctions
list of strings for additional casing augmentation to apply
returns
the augmented DataFrame
def buildCorpus(cDF: DataFrame, nByC: DataFrame, sByN: DataFrame, annotationCol: String, chunkCol: String, doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, casingFunctions: Seq[String] = Seq()): DataFrame
Returns the augmented corpus DataFrame, which depending on the AugmentationMode will have two (PLAIN_TEXT) or three (CHUNK) columns: code: document identifier description: actual text of the augmented document annotation: augmented chunk annotations
Returns the augmented corpus DataFrame, which depending on the AugmentationMode will have two (PLAIN_TEXT) or three (CHUNK) columns: code: document identifier description: actual text of the augmented document annotation: augmented chunk annotations
cDF
DataFrame as it comes out after running the nerPipeline (with renamed columns for text and chunk)
sByN
the Synonym DataFrame as returned from synonymsByNer in its first position
doProduct
whether or not to combine all possible synonyms
augmentationMode
whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation
returns
the Corpus DataFrame
val caseSensitive: Boolean
def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit

Definition Classes
CheckLicense
def checkValidScope(scope: String): Unit

Definition Classes
CheckLicense
def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes
CheckLicense
def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes
CheckLicense
def clone(): AnyRef

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( ... ) @native()
val codeCol: String
val colFieldMap: Map[String, Map[String, String]]
val descriptionCol: String
def detectCasing(text: String, delimiter: String): String
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def finalize(): Unit

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
Annotations
@native()
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
val mrconsoHeaders: Seq[String]
val mrconsoTableName: String
val mrrelHeaders: Seq[String]
val mrrelTableName: String
val mrstyHeaders: Seq[String]
val mrstyTableName: String
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def nerByCode(corpusDF: DataFrame, nerPipeline: PipelineModel): DataFrame
Returns the transformed DataFrame after applying the provided nerPipeline
Returns the transformed DataFrame after applying the provided nerPipeline
corpusDF
DataFrame with a descriptionCol of type text to start the pipeline
nerPipeline
Spark ML Pipeline of Annotator stages including a NerConverterInternalModel stage
returns
the transformed DataFrame after applying the pipeline
val nerCuisCols: Seq[String]
val nerResult: String
final def notify(): Unit

Definition Classes
AnyRef
Annotations
@native()
final def notifyAll(): Unit

Definition Classes
AnyRef
Annotations
@native()
val originalNer: String
def parseAndMatch(nersMissingCui: DataFrame, cuis: DataFrame, regexParsersLeft: Seq[(String, String)] = Seq(), nersCuisCum: DataFrame = sparkSession.emptyDataFrame): (DataFrame, DataFrame)
def resolutionSynonymsByNer(ners: DataFrame): (DataFrame, DataFrame)
val resolvedCosineDistance: String
val resolvedEuclideanDistance: String
val resolvedResult: String
val sparkSession: SparkSession
val synColname: String
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def synonymsByNer(ners: DataFrame, language: String = "ENG", regexParsers: Seq[(String, String)] = Seq()): (DataFrame, DataFrame)
Returns a tuple of two DataFrames.
Returns a tuple of two DataFrames. The first DataFrame is a Synonym DataFrame with columns: cui: CUI Code for matched chunk original_ner: Original mention predicted ner_result: Result after enough regex_replace patterns to find a match synonym: Synonym for ner_result
The second DataFrame consists NERs missing UMLS matches (it can be helpful to generate new regex_replace patterns and rerun); its columns are: original_ner: Original mention predicted ner_result: Result after enough regex_replace patterns to find a match
ners
a DataFrame with two columns: a string for document identifier and 'ner_result' an array of strings with predicted mentions
language
three letter uppercase symbol for the desired language to filter the UMLS CUIs
regexParsers
a List of tuples of two Strings; (regex, replace) patterns to apply recursively on the predicted mentions
returns
a tuple of two DataFrames: a Synonym DataFrame and a DataFrame with entities missing UMLS matches
val umlsMetaPath: String
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... ) @native()

Packages

SynonymAugmentationUMLS

case class SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false) extends CheckLicense with Product with Serializable

Example

Augmenting a simple sentence

Instance Constructors

Value Members

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from CheckLicense

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

SynonymAugmentationUMLS 

case class SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false) extends CheckLicense with Product with Serializable

Example

Augmenting a simple sentence

Instance Constructors

Value Members

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from CheckLicense

Inherited from AnyRef

Inherited from Any

Ungrouped

SynonymAugmentationUMLS