embeddings

package embeddings

Ordering

Alphabetic

Visibility

Public
All

Type Members

class AverageEmbeddings extends AnnotatorModel[AverageEmbeddings] with HasSimpleAnnotate[AverageEmbeddings] with HasStorageRef with HasEmbeddingsProperties with CheckLicense
Merge embdeddings.

class BertSentenceChunkEmbeddings extends BertSentenceEmbeddings with HandleExceptionParams with HasSafeBatchAnnotate[BertSentenceEmbeddings] with CheckLicense

BERT Sentence embeddings for chunk annotations which take into account the context of the sentence the chunk appeared in.

BERT Sentence embeddings for chunk annotations which take into account the context of the sentence the chunk appeared in. This is an extension of BertSentenceEmbeddings which combines the embedding of a chunk with the embedding of the surrounding sentence. For each input chunk annotation, it finds the corresponding sentence, computes the BERT sentence embedding of both the chunk and the sentence and averages them. The resulting embeddings are useful in cases, in which one needs a numerical representation of a text chunk which is sensitive to the context it appears in.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.

Two input columns are required - chunk and sentence.

val embeddings = BertSentenceChunkEmbeddings.pretrained()
  .setInputCols("sentence", "chunk")
  .setOutputCol("sentence_chunk_bert_embeddings")

The default model is "sent_small_bert_L2_768", if no name is provided.

Sources :

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
   .setInputCol("text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
   .setInputCols("document")
   .setOutputCol("sentence")

val tokenizer = new Tokenizer()
   .setInputCols("sentence")
   .setOutputCol("tokens")

val wordEmbeddings = BertEmbeddings
   .pretrained("biobert_pubmed_base_cased")
   .setInputCols(Array("sentence", "tokens"))
   .setOutputCol("word_embeddings")

val nerModel = MedicalNerModel
   .pretrained("ner_clinical_biobert", "en", "clinical/models")
   .setInputCols(Array("sentence", "tokens", "word_embeddings"))
   .setOutputCol("ner")

 val nerConverter = new NerConverter()
   .setInputCols("sentence", "tokens", "ner")
   .setOutputCol("ner_chunk")

val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings
   .pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
    .setInputCols(Array("sentence", "ner_chunk"))
    .setOutputCol("sentence_chunk_embeddings")

val pipeline = new Pipeline()
     .setStages(Array(
         documentAssembler,
         sentenceDetector,
         tokenizer,
         wordEmbeddings,
         nerModel,
         nerConverter,
         sentenceChunkEmbeddings))

val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
   " He complains of swelling in his right forearm."

val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)

result
   .selectExpr("explode(sentence_chunk_embeddings) AS s")
   .selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")
   .show(truncate=false)

+-----------------------------+-----------------------------------------------------------------+
|                       result|                                                 averageEmbedding|
+-----------------------------+-----------------------------------------------------------------+
|Her Diabetes                 |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072]  |
|type 2                       |[-0.027161136, -0.24613449, -0.0949309, 0.1825444, -0.2252143]   |
|her Diabetes                 |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072]  |
|swelling in his right forearm|[-0.45139068, 0.12400375, -0.0075617577, -0.90806055, 0.12871636]|
+-----------------------------+-----------------------------------------------------------------+

See also: BertEmbeddings for token-level embeddings
BertSentenceEmbeddings for sentence-level embeddings
Annotators Main Page for a list of transformer based embeddings

class EntityChunkEmbeddings extends BertSentenceEmbeddings with CheckLicense

Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks.

Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related). The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold. The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence "The patient was given 125 mg of paracetamol and metformin", the model will pair "125 mg" to "paracetamol", but not to "metformin". The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type. An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. "DRUG:SYMPTOM" to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.

Two input columns are required - chunks and dependency annotations.

val embeddings = EntityChunkEmbeddings.pretrained()
  .setInputCols("sentence", "dependencies")
  .setOutputCol("entity_chunk_embeddings")

The default model is "sbiobert_base_cased_mli" from "clinical/models".

Sources :

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.ner.{MedicalNerModel, NerConverterInternal}
import com.johnsnowlabs.nlp.annotators.embeddings.EntityChunkEmbeddings
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
   .setInputCol("text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
   .setInputCols("document")
   .setOutputCol("sentence")

val tokenizer = new Tokenizer()
   .setInputCols("sentence")
   .setOutputCol("tokens")

 val wordEmbeddings = WordEmbeddingsModel
   .pretrained("embeddings_clinical", "en", "clinical/models")
   .setInputCols(Array("sentences", "tokens"))
   .setOutputCol("word_embeddings")

 val nerModel = MedicalNerModel
   .pretrained("ner_posology_large", "en", "clinical/models")
   .setInputCols(Array("sentence", "tokens", "word_embeddings"))
   .setOutputCol("ner")

 val nerConverter = new NerConverterInternal()
   .setInputCols("sentence", "tokens", "ner")
   .setOutputCol("ner_chunk")

 val posTager = PerceptronModel
   .pretrained("pos_clinical", "en", "clinical/models")
   .setInputCols("sentences", "tokens")
   .setOutputCol("pos_tags")

 val dependencyParser = DependencyParserModel
   .pretrained("dependency_conllu", "en")
   .setInputCols(Array("sentences", "pos_tags", "tokens"))
   .setOutputCol("dependencies")

 val drugChunkEmbeddings = EntityChunkEmbeddings
   .pretrained("sbiobert_base_cased_mli","en","clinical/models")
   .setInputCols(Array("ner_chunks", "dependencies"))
   .setOutputCol("drug_chunk_embeddings")
   .setMaxSyntacticDistance(3)
   .setTargetEntities(Map("DRUG" -> List()))
   .setEntityWeights(Map[String, Float]("DRUG" -> 0.8f, "STRENGTH" -> 0.2f, "DOSAGE" -> 0.2f, "FORM" -> 0.5f))

val pipeline = new Pipeline()
     .setStages(Array(
         documentAssembler,
         sentenceDetector,
         tokenizer,
         wordEmbeddings,
         nerModel,
         nerConverter,
         posTager,
         dependencyParser,
         drugChunkEmbeddings))

val sampleText = "The patient was given metformin 125 mg, 250 mg of coumadin and then one pill paracetamol."

val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)

result
   .selectExpr("explode(drug_chunk_embeddings) AS drug_chunk")
   .selectExpr("drug_chunk.result", "slice(drug_chunk.embeddings, 1, 5) AS drugEmbedding")
   .show(truncate=false)

+-----------------------------+-----------------------------------------------------------------+
|                       result|                                                    drugEmbedding|
+-----------------------------+-----------------------------------------------------------------+
|metformin 125 mg             |[-0.267413, 0.07614058, -0.5620966, 0.83838946, 0.8911504]       |
|250 mg coumadin              |[0.22319649, -0.07094894, -0.6885556, 0.79176235, 0.82672405]    |
|one pill paracetamol          |[-0.10939768, -0.29242, -0.3574444, 0.3981813, 0.79609615]      |
+-----------------------------+----------------------------------------------------------------+

See also: BertEmbeddings for token-level embeddings
BertSentenceEmbeddings for sentence-level embeddings
Annotators Main Page for a list of transformer based embeddings

class ExtractiveSummarization extends AnnotatorModel[ExtractiveSummarization] with HasSimpleAnnotate[ExtractiveSummarization] with CheckLicense
trait ReadBertSentenceChunksTensorflowModel extends ReadTensorflowModel
trait ReadEntityChunkEmbeddingsTensorflowModel extends ReadTensorflowModel
trait ReadablePretrainedBertSentenceChunksModel extends ParamsAndFeaturesReadable[BertSentenceChunkEmbeddings] with HasPretrained[BertSentenceChunkEmbeddings]
trait ReadablePretrainedEntityChunkEmbeddings extends ParamsAndFeaturesReadable[EntityChunkEmbeddings] with HasPretrained[EntityChunkEmbeddings]

Value Members

object BertSentenceChunkEmbeddings extends ReadablePretrainedBertSentenceChunksModel with ReadBertSentenceChunksTensorflowModel with Serializable
object EntityChunkEmbeddings extends ReadablePretrainedEntityChunkEmbeddings with ReadEntityChunkEmbeddingsTensorflowModel with Serializable
object ExtractiveSummarization extends ParamsAndFeaturesReadable[ExtractiveSummarization] with Serializable

Packages

embeddings

package embeddings

Type Members

Example

Example

Value Members

Ungrouped

Packages

embeddings 

package embeddings

Type Members

Example

Example

Value Members

Ungrouped

embeddings