c

com.johnsnowlabs.nlp.training

SynonymAugmentationUMLS

case class SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false) extends CheckLicense with Product with Serializable

Contains all methods to augment any given DataFrame in terms of a Combinatorial NER Synonym Matching using UMLS or SentenceEntityResolvers. The augment function takes a DataFrame and NER PipelineModel and augments it by exactly matching Named Entities through UMLS Synonym Relation or by using any SentenceEntityResolver's output metadata. UMLS META directory is expected to be in the FS if UMLS is used as SynonymSource; in case RESOLUTIONS are used as SynonymSource umlsMetaPath parameter is ignored. The DataFrame is expected to have two columns: an 'identification' column (hopefully unique for user's sake) and an 'information' text column. In case the augment function is called with augmentationMode=="chunk", the 'information' column should be the output of a chunk AnnotatorType. Pipeline is expected to have one single stage for each Annotator.

Example

Augmenting a simple sentence

Define or load an NER pipeline including a chunk AnnotatorType for your source data:

val doc = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val tkn = new Tokenizer().setInputCols("document").setOutputCol("token")
val embs = WordEmbeddingsModel.pretrained("embeddings_clinical", "en" , "clinical/models")
          .setInputCols("document","token").setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
          .setInputCols("document","token","embeddings").setOutputCol("ner")
val conv = new NerConverterInternal().setInputCols("document","token","ner").setOutputCol("ner_chunk")
val edf = ResourceHelper.spark.createDataFrame(Array(Tuple1(""))).toDF("text")
val plModel = new Pipeline().setStages(Array(doc,tkn,embs,ner,conv)).fit(edf)

Then we can create the augmenter object and call the augment function like the following:

val augmenter = SynonymAugmentationUMLS(ResourceHelper.spark, "src/test/resources/synonym-augmentation/mini_umls_meta", "id", "text")
val synsSimple = augmenter.augmentCsv(edf, plModel, "ENG", false, AugmentationModes.PLAIN_TEXT).cache()
syns.orderBy("code").show(1000, false)
print(syns.count()) // Will most probably exceed the original number of unique rows due to augmentation
Linear Supertypes
Serializable, Serializable, Product, Equals, CheckLicense, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SynonymAugmentationUMLS
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. CheckLicense
  7. AnyRef
  8. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SynonymAugmentationUMLS(sparkSession: SparkSession, umlsMetaPath: String = "", codeCol: String = "code", descriptionCol: String = "description", caseSensitive: Boolean = false)

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def applyCasingAugmentation(df: DataFrame, inputName: String, columnName: String, casingFunctions: Seq[String], delimiter: String = " "): DataFrame
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def augmentCsv(corpusCsvPath: String, nlpPipeline: PipelineModel, language: String = "ENG", doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, synonymSource: String = SynonymSources.UMLS, regexParsers: List[(String, String)] = ..., euclideanDistanceThreshold: Double = 10, cosineDistanceThreshold: Double = 0.25, synonymLimit: Int = 5, casingFunctions: List[String] = Seq(CasingFunctions.INFER).asJava): DataFrame

    corpusCsvPath

    path to the csv with identification and information column and any other column needed in the provided pipeline

    nlpPipeline

    SparkNLP Pipeline including all stages until the AnnotatorType for the AugmentationMode selected (CHUNK / ENTITY)

    language

    three letter upper-case code for UMLS language

    doProduct

    whether or not to combine all possible synonyms

    augmentationMode

    whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation

    synonymSource

    whether to pick the synonyms from UMLS or from SentenceEntityResolver's metadata

    regexParsers

    ordered regexp replace patterns to apply recursively to the Recognized Entity in order to find an exact match in UMLS

    euclideanDistanceThreshold

    max euclidean distance of a resolution to be considered in the augmentation

    cosineDistanceThreshold

    max cosine distance of a resolution to be considered in the augmentation

    synonymLimit

    max number of resolutions to be used as synonyms

    casingFunctions

    list of strings for additional casing augmentation to apply

    returns

    the augmented DataFrame

  7. def augmentDataFrame(corpusDF: DataFrame, nlpPipeline: PipelineModel, language: String = "ENG", doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, synonymSource: String = SynonymSources.UMLS, regexParsers: List[(String, String)] = ..., euclideanDistanceThreshold: Double = 10, cosineDistanceThreshold: Double = 0.25, synonymLimit: Int = 5, casingFunctions: List[String] = Seq(CasingFunctions.INFER).asJava): DataFrame

    corpusDF

    actual DataFrame with identification and information column and any other column needed in the provided pipeline

    nlpPipeline

    SparkNLP Pipeline including all stages until the AnnotatorType for the AugmentationMode selected (CHUNK / ENTITY)

    language

    three letter upper-case code for UMLS language

    doProduct

    whether or not to combine all possible synonyms

    augmentationMode

    whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation

    synonymSource

    whether to pick the synonyms from UMLS or from SentenceEntityResolver's metadata

    regexParsers

    ordered regexp replace patterns to apply recursively to the Recognized Entity in order to find an exact match in UMLS

    euclideanDistanceThreshold

    max euclidean distance of a resolution to be considered in the augmentation

    cosineDistanceThreshold

    max cosine distance of a resolution to be considered in the augmentation

    synonymLimit

    max number of resolutions to be used as synonyms

    casingFunctions

    list of strings for additional casing augmentation to apply

    returns

    the augmented DataFrame

  8. def buildCorpus(cDF: DataFrame, nByC: DataFrame, sByN: DataFrame, annotationCol: String, chunkCol: String, doProduct: Boolean = false, augmentationMode: String = AugmentationModes.PLAIN_TEXT, casingFunctions: Seq[String] = Seq()): DataFrame

    Returns the augmented corpus DataFrame, which depending on the AugmentationMode will have two (PLAIN_TEXT) or three (CHUNK) columns: code: document identifier description: actual text of the augmented document annotation: augmented chunk annotations

    Returns the augmented corpus DataFrame, which depending on the AugmentationMode will have two (PLAIN_TEXT) or three (CHUNK) columns: code: document identifier description: actual text of the augmented document annotation: augmented chunk annotations

    cDF

    DataFrame as it comes out after running the nerPipeline (with renamed columns for text and chunk)

    sByN

    the Synonym DataFrame as returned from synonymsByNer in its first position

    doProduct

    whether or not to combine all possible synonyms

    augmentationMode

    whether to do simply PLAIN_TEXT-based, CHUNK-based or ENTITY-based augmentation

    returns

    the Corpus DataFrame

  9. val caseSensitive: Boolean
  10. def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit
    Definition Classes
    CheckLicense
  11. def checkValidScope(scope: String): Unit
    Definition Classes
    CheckLicense
  12. def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  13. def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  14. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  15. val codeCol: String
  16. val colFieldMap: Map[String, Map[String, String]]
  17. val descriptionCol: String
  18. def detectCasing(text: String, delimiter: String): String
  19. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  20. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  22. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  23. val mrconsoHeaders: Seq[String]
  24. val mrconsoTableName: String
  25. val mrrelHeaders: Seq[String]
  26. val mrrelTableName: String
  27. val mrstyHeaders: Seq[String]
  28. val mrstyTableName: String
  29. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  30. def nerByCode(corpusDF: DataFrame, nerPipeline: PipelineModel): DataFrame

    Returns the transformed DataFrame after applying the provided nerPipeline

    Returns the transformed DataFrame after applying the provided nerPipeline

    corpusDF

    DataFrame with a descriptionCol of type text to start the pipeline

    nerPipeline

    Spark ML Pipeline of Annotator stages including a NerConverterInternalModel stage

    returns

    the transformed DataFrame after applying the pipeline

  31. val nerCuisCols: Seq[String]
  32. val nerResult: String
  33. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  34. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  35. val originalNer: String
  36. def parseAndMatch(nersMissingCui: DataFrame, cuis: DataFrame, regexParsersLeft: Seq[(String, String)] = Seq(), nersCuisCum: DataFrame = sparkSession.emptyDataFrame): (DataFrame, DataFrame)

  37. def resolutionSynonymsByNer(ners: DataFrame): (DataFrame, DataFrame)
  38. val resolvedCosineDistance: String
  39. val resolvedEuclideanDistance: String
  40. val resolvedResult: String
  41. val sparkSession: SparkSession
  42. val synColname: String
  43. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  44. def synonymsByNer(ners: DataFrame, language: String = "ENG", regexParsers: Seq[(String, String)] = Seq()): (DataFrame, DataFrame)

    Returns a tuple of two DataFrames.

    Returns a tuple of two DataFrames. The first DataFrame is a Synonym DataFrame with columns: cui: CUI Code for matched chunk original_ner: Original mention predicted ner_result: Result after enough regex_replace patterns to find a match synonym: Synonym for ner_result

    The second DataFrame consists NERs missing UMLS matches (it can be helpful to generate new regex_replace patterns and rerun); its columns are: original_ner: Original mention predicted ner_result: Result after enough regex_replace patterns to find a match

    ners

    a DataFrame with two columns: a string for document identifier and 'ner_result' an array of strings with predicted mentions

    language

    three letter uppercase symbol for the desired language to filter the UMLS CUIs

    regexParsers

    a List of tuples of two Strings; (regex, replace) patterns to apply recursively on the predicted mentions

    returns

    a tuple of two DataFrames: a Synonym DataFrame and a DataFrame with entities missing UMLS matches

  45. val umlsMetaPath: String
  46. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  47. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  48. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from CheckLicense

Inherited from AnyRef

Inherited from Any

Ungrouped