case class SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false) extends CheckLicense with Product with Serializable
Contains all methods to augment any given DataFrame in terms of a Combinatorial NER Synonym Matching using UMLS or SentenceEntityResolvers. The augment function takes a DataFrame and NER PipelineModel and augments it by exactly matching Named Entities through UMLS Synonym Relation or by using any SentenceEntityResolver's output metadata. UMLS META directory is expected to be in the FS if UMLS is used as SynonymSource; in case RESOLUTIONS are used as SynonymSource umlsMetaPath parameter is ignored. The DataFrame is expected to have two columns: an 'identification' column (hopefully unique for user's sake) and an 'information' text column. In case the augment function is called with augmentationMode=="chunk", the 'information' column should be the output of a chunk AnnotatorType. Pipeline is expected to have one single stage for each Annotator.
Example
Augmenting a simple sentence
Define or load an NER pipeline including a chunk AnnotatorType for your source data:
val doc = new DocumentAssembler().setInputCol("text").setOutputCol("document") val tkn = new Tokenizer().setInputCols("document").setOutputCol("token") val embs = WordEmbeddingsModel.pretrained("embeddings_clinical", "en" , "clinical/models") .setInputCols("document","token").setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols("document","token","embeddings").setOutputCol("ner") val conv = new NerConverterInternal().setInputCols("document","token","ner").setOutputCol("ner_chunk") val edf = ResourceHelper.spark.createDataFrame(Array(Tuple1(""))).toDF("text") val plModel = new Pipeline().setStages(Array(doc,tkn,embs,ner,conv)).fit(edf)
Then we can create the augmenter object and call the augment function like the following:
val augmenter = SynonymAugmentationUMLS(ResourceHelper.spark, "src/test/resources/synonym-augmentation/mini_umls_meta", "id", "text") val synsSimple = augmenter.augmentCsv(edf, plModel, "ENG", false, AugmentationModes.PLAIN_TEXT).cache() syns.orderBy("code").show(1000, false) print(syns.count()) // Will most probably exceed the original number of unique rows due to augmentation
- Alphabetic
- By Inheritance
- SynonymAugmentationUMLS
- Serializable
- Serializable
- Product
- Equals
- CheckLicense
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
- new SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false)
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- def applyCasingAugmentation(df: DataFrame, inputName: String, columnName: String, casingFunctions: Seq[String], delimiter: String = " "): DataFrame
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
augmentCsv(corpusCsvPath: String, nlpPipeline: PipelineModel, language: String = "ENG", doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, synonymSource: String = SynonymSources.UMLS, regexParsers: List[(String, String)] = ..., euclideanDistanceThreshold: Double = 10, cosineDistanceThreshold: Double = 0.25, synonymLimit: Int = 5, casingFunctions: List[String] = Seq(CasingFunctions.INFER).asJava): DataFrame
- corpusCsvPath
path to the csv with identification and information column and any other column needed in the provided pipeline
- nlpPipeline
SparkNLP Pipeline including all stages until the AnnotatorType for the AugmentationMode selected (CHUNK / ENTITY)
- language
three letter upper-case code for UMLS language
- doProduct
whether or not to combine all possible synonyms
- augmentationMode
whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation
- synonymSource
whether to pick the synonyms from UMLS or from SentenceEntityResolver's metadata
- regexParsers
ordered regexp replace patterns to apply recursively to the Recognized Entity in order to find an exact match in UMLS
- euclideanDistanceThreshold
max euclidean distance of a resolution to be considered in the augmentation
- cosineDistanceThreshold
max cosine distance of a resolution to be considered in the augmentation
- synonymLimit
max number of resolutions to be used as synonyms
- casingFunctions
list of strings for additional casing augmentation to apply
- returns
the augmented DataFrame
-
def
augmentDataFrame(corpusDF: DataFrame, nlpPipeline: PipelineModel, language: String = "ENG", doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, synonymSource: String = SynonymSources.UMLS, regexParsers: List[(String, String)] = ..., euclideanDistanceThreshold: Double = 10, cosineDistanceThreshold: Double = 0.25, synonymLimit: Int = 5, casingFunctions: List[String] = Seq(CasingFunctions.INFER).asJava): DataFrame
- corpusDF
actual DataFrame with identification and information column and any other column needed in the provided pipeline
- nlpPipeline
SparkNLP Pipeline including all stages until the AnnotatorType for the AugmentationMode selected (CHUNK / ENTITY)
- language
three letter upper-case code for UMLS language
- doProduct
whether or not to combine all possible synonyms
- augmentationMode
whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation
- synonymSource
whether to pick the synonyms from UMLS or from SentenceEntityResolver's metadata
- regexParsers
ordered regexp replace patterns to apply recursively to the Recognized Entity in order to find an exact match in UMLS
- euclideanDistanceThreshold
max euclidean distance of a resolution to be considered in the augmentation
- cosineDistanceThreshold
max cosine distance of a resolution to be considered in the augmentation
- synonymLimit
max number of resolutions to be used as synonyms
- casingFunctions
list of strings for additional casing augmentation to apply
- returns
the augmented DataFrame
-
def
buildCorpus(cDF: DataFrame, nByC: DataFrame, sByN: DataFrame, annotationCol: String, chunkCol: String, doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, casingFunctions: Seq[String] = Seq()): DataFrame
Returns the augmented corpus DataFrame, which depending on the AugmentationMode will have two (PLAIN_TEXT) or three (CHUNK) columns: code: document identifier description: actual text of the augmented document annotation: augmented chunk annotations
Returns the augmented corpus DataFrame, which depending on the AugmentationMode will have two (PLAIN_TEXT) or three (CHUNK) columns: code: document identifier description: actual text of the augmented document annotation: augmented chunk annotations
- cDF
DataFrame as it comes out after running the nerPipeline (with renamed columns for text and chunk)
- sByN
the Synonym DataFrame as returned from synonymsByNer in its first position
- doProduct
whether or not to combine all possible synonyms
- augmentationMode
whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation
- returns
the Corpus DataFrame
- val caseSensitive: Boolean
-
def
checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScope(scope: String): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit
- Definition Classes
- CheckLicense
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
- val codeCol: String
- val colFieldMap: Map[String, Map[String, String]]
- val descriptionCol: String
- def detectCasing(text: String, delimiter: String): String
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- val mrconsoHeaders: Seq[String]
- val mrconsoTableName: String
- val mrrelHeaders: Seq[String]
- val mrrelTableName: String
- val mrstyHeaders: Seq[String]
- val mrstyTableName: String
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
nerByCode(corpusDF: DataFrame, nerPipeline: PipelineModel): DataFrame
Returns the transformed DataFrame after applying the provided nerPipeline
Returns the transformed DataFrame after applying the provided nerPipeline
- corpusDF
DataFrame with a descriptionCol of type text to start the pipeline
- nerPipeline
Spark ML Pipeline of Annotator stages including a NerConverterInternalModel stage
- returns
the transformed DataFrame after applying the pipeline
- val nerCuisCols: Seq[String]
- val nerResult: String
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- val originalNer: String
- def parseAndMatch(nersMissingCui: DataFrame, cuis: DataFrame, regexParsersLeft: Seq[(String, String)] = Seq(), nersCuisCum: DataFrame = sparkSession.emptyDataFrame): (DataFrame, DataFrame)
- def resolutionSynonymsByNer(ners: DataFrame): (DataFrame, DataFrame)
- val resolvedCosineDistance: String
- val resolvedEuclideanDistance: String
- val resolvedResult: String
- val sparkSession: SparkSession
- val synColname: String
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
synonymsByNer(ners: DataFrame, language: String = "ENG", regexParsers: Seq[(String, String)] = Seq()): (DataFrame, DataFrame)
Returns a tuple of two DataFrames.
Returns a tuple of two DataFrames. The first DataFrame is a Synonym DataFrame with columns: cui: CUI Code for matched chunk original_ner: Original mention predicted ner_result: Result after enough regex_replace patterns to find a match synonym: Synonym for ner_result
The second DataFrame consists NERs missing UMLS matches (it can be helpful to generate new regex_replace patterns and rerun); its columns are: original_ner: Original mention predicted ner_result: Result after enough regex_replace patterns to find a match
- ners
a DataFrame with two columns: a string for document identifier and 'ner_result' an array of strings with predicted mentions
- language
three letter uppercase symbol for the desired language to filter the UMLS CUIs
- regexParsers
a List of tuples of two Strings; (regex, replace) patterns to apply recursively on the predicted mentions
- returns
a tuple of two DataFrames: a Synonym DataFrame and a DataFrame with entities missing UMLS matches
- val umlsMetaPath: String
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()