Packages

package embeddings

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class AverageEmbeddings extends AnnotatorModel[AverageEmbeddings] with HasSimpleAnnotate[AverageEmbeddings] with HasStorageRef with HasEmbeddingsProperties with CheckLicense

    Merge embdeddings.

  2. class BertSentenceChunkEmbeddings extends BertSentenceEmbeddings with HandleExceptionParams with HasSafeBatchAnnotate[BertSentenceEmbeddings] with CheckLicense

    BERT Sentence embeddings for chunk annotations which take into account the context of the sentence the chunk appeared in.

    BERT Sentence embeddings for chunk annotations which take into account the context of the sentence the chunk appeared in. This is an extension of BertSentenceEmbeddings which combines the embedding of a chunk with the embedding of the surrounding sentence. For each input chunk annotation, it finds the corresponding sentence, computes the BERT sentence embedding of both the chunk and the sentence and averages them. The resulting embeddings are useful in cases, in which one needs a numerical representation of a text chunk which is sensitive to the context it appears in.

    This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.

    Two input columns are required - chunk and sentence.

    val embeddings = BertSentenceChunkEmbeddings.pretrained()
      .setInputCols("sentence", "chunk")
      .setOutputCol("sentence_chunk_bert_embeddings")

    The default model is "sent_small_bert_L2_768", if no name is provided.

    Sources :

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
       .setInputCol("text")
       .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
       .setInputCols("document")
       .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
       .setInputCols("sentence")
       .setOutputCol("tokens")
    
    val wordEmbeddings = BertEmbeddings
       .pretrained("biobert_pubmed_base_cased")
       .setInputCols(Array("sentence", "tokens"))
       .setOutputCol("word_embeddings")
    
    val nerModel = MedicalNerModel
       .pretrained("ner_clinical_biobert", "en", "clinical/models")
       .setInputCols(Array("sentence", "tokens", "word_embeddings"))
       .setOutputCol("ner")
    
     val nerConverter = new NerConverter()
       .setInputCols("sentence", "tokens", "ner")
       .setOutputCol("ner_chunk")
    
    val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings
       .pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
        .setInputCols(Array("sentence", "ner_chunk"))
        .setOutputCol("sentence_chunk_embeddings")
    
    val pipeline = new Pipeline()
         .setStages(Array(
             documentAssembler,
             sentenceDetector,
             tokenizer,
             wordEmbeddings,
             nerModel,
             nerConverter,
             sentenceChunkEmbeddings))
    
    val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
       " He complains of swelling in his right forearm."
    
    val testDataset = Seq("").toDS.toDF("text")
    val result = pipeline.fit(emptyDataset).transform(testDataset)
    
    result
       .selectExpr("explode(sentence_chunk_embeddings) AS s")
       .selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")
       .show(truncate=false)
    
    +-----------------------------+-----------------------------------------------------------------+
    |                       result|                                                 averageEmbedding|
    +-----------------------------+-----------------------------------------------------------------+
    |Her Diabetes                 |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072]  |
    |type 2                       |[-0.027161136, -0.24613449, -0.0949309, 0.1825444, -0.2252143]   |
    |her Diabetes                 |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072]  |
    |swelling in his right forearm|[-0.45139068, 0.12400375, -0.0075617577, -0.90806055, 0.12871636]|
    +-----------------------------+-----------------------------------------------------------------+
    See also

    BertEmbeddings for token-level embeddings

    BertSentenceEmbeddings for sentence-level embeddings

    Annotators Main Page for a list of transformer based embeddings

  3. class EntityChunkEmbeddings extends BertSentenceEmbeddings with CheckLicense

    Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks.

    Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related). The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold. The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence "The patient was given 125 mg of paracetamol and metformin", the model will pair "125 mg" to "paracetamol", but not to "metformin". The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type. An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. "DRUG:SYMPTOM" to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.

    This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.

    Two input columns are required - chunks and dependency annotations.

    val embeddings = EntityChunkEmbeddings.pretrained()
      .setInputCols("sentence", "dependencies")
      .setOutputCol("entity_chunk_embeddings")

    The default model is "sbiobert_base_cased_mli" from "clinical/models".

    Sources :

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.ner.{MedicalNerModel, NerConverterInternal}
    import com.johnsnowlabs.nlp.annotators.embeddings.EntityChunkEmbeddings
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
       .setInputCol("text")
       .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
       .setInputCols("document")
       .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
       .setInputCols("sentence")
       .setOutputCol("tokens")
    
     val wordEmbeddings = WordEmbeddingsModel
       .pretrained("embeddings_clinical", "en", "clinical/models")
       .setInputCols(Array("sentences", "tokens"))
       .setOutputCol("word_embeddings")
    
     val nerModel = MedicalNerModel
       .pretrained("ner_posology_large", "en", "clinical/models")
       .setInputCols(Array("sentence", "tokens", "word_embeddings"))
       .setOutputCol("ner")
    
     val nerConverter = new NerConverterInternal()
       .setInputCols("sentence", "tokens", "ner")
       .setOutputCol("ner_chunk")
    
     val posTager = PerceptronModel
       .pretrained("pos_clinical", "en", "clinical/models")
       .setInputCols("sentences", "tokens")
       .setOutputCol("pos_tags")
    
     val dependencyParser = DependencyParserModel
       .pretrained("dependency_conllu", "en")
       .setInputCols(Array("sentences", "pos_tags", "tokens"))
       .setOutputCol("dependencies")
    
     val drugChunkEmbeddings = EntityChunkEmbeddings
       .pretrained("sbiobert_base_cased_mli","en","clinical/models")
       .setInputCols(Array("ner_chunks", "dependencies"))
       .setOutputCol("drug_chunk_embeddings")
       .setMaxSyntacticDistance(3)
       .setTargetEntities(Map("DRUG" -> List()))
       .setEntityWeights(Map[String, Float]("DRUG" -> 0.8f, "STRENGTH" -> 0.2f, "DOSAGE" -> 0.2f, "FORM" -> 0.5f))
    
    val pipeline = new Pipeline()
         .setStages(Array(
             documentAssembler,
             sentenceDetector,
             tokenizer,
             wordEmbeddings,
             nerModel,
             nerConverter,
             posTager,
             dependencyParser,
             drugChunkEmbeddings))
    
    val sampleText = "The patient was given metformin 125 mg, 250 mg of coumadin and then one pill paracetamol."
    
    val testDataset = Seq("").toDS.toDF("text")
    val result = pipeline.fit(emptyDataset).transform(testDataset)
    
    result
       .selectExpr("explode(drug_chunk_embeddings) AS drug_chunk")
       .selectExpr("drug_chunk.result", "slice(drug_chunk.embeddings, 1, 5) AS drugEmbedding")
       .show(truncate=false)
    
    +-----------------------------+-----------------------------------------------------------------+
    |                       result|                                                    drugEmbedding|
    +-----------------------------+-----------------------------------------------------------------+
    |metformin 125 mg             |[-0.267413, 0.07614058, -0.5620966, 0.83838946, 0.8911504]       |
    |250 mg coumadin              |[0.22319649, -0.07094894, -0.6885556, 0.79176235, 0.82672405]    |
    |one pill paracetamol          |[-0.10939768, -0.29242, -0.3574444, 0.3981813, 0.79609615]      |
    +-----------------------------+----------------------------------------------------------------+
    See also

    BertEmbeddings for token-level embeddings

    BertSentenceEmbeddings for sentence-level embeddings

    Annotators Main Page for a list of transformer based embeddings

  4. class ExtractiveSummarization extends AnnotatorModel[ExtractiveSummarization] with HasSimpleAnnotate[ExtractiveSummarization] with CheckLicense
  5. trait ReadBertSentenceChunksTensorflowModel extends ReadTensorflowModel
  6. trait ReadEntityChunkEmbeddingsTensorflowModel extends ReadTensorflowModel
  7. trait ReadablePretrainedBertSentenceChunksModel extends ParamsAndFeaturesReadable[BertSentenceChunkEmbeddings] with HasPretrained[BertSentenceChunkEmbeddings]
  8. trait ReadablePretrainedEntityChunkEmbeddings extends ParamsAndFeaturesReadable[EntityChunkEmbeddings] with HasPretrained[EntityChunkEmbeddings]

Ungrouped