package embeddings
- Alphabetic
- Public
- All
Type Members
-
class
AverageEmbeddings extends AnnotatorModel[AverageEmbeddings] with HasSimpleAnnotate[AverageEmbeddings] with HasStorageRef with HasEmbeddingsProperties with CheckLicense
Merge embdeddings.
-
class
BertSentenceChunkEmbeddings extends BertSentenceEmbeddings with HandleExceptionParams with HasSafeBatchAnnotate[BertSentenceEmbeddings] with CheckLicense
BERT Sentence embeddings for chunk annotations which take into account the context of the sentence the chunk appeared in.
BERT Sentence embeddings for chunk annotations which take into account the context of the sentence the chunk appeared in. This is an extension of BertSentenceEmbeddings which combines the embedding of a chunk with the embedding of the surrounding sentence. For each input chunk annotation, it finds the corresponding sentence, computes the BERT sentence embedding of both the chunk and the sentence and averages them. The resulting embeddings are useful in cases, in which one needs a numerical representation of a text chunk which is sensitive to the context it appears in.
This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.
Two input columns are required - chunk and sentence.
val embeddings = BertSentenceChunkEmbeddings.pretrained() .setInputCols("sentence", "chunk") .setOutputCol("sentence_chunk_bert_embeddings")
The default model is
"sent_small_bert_L2_768"
, if no name is provided.Sources :
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings import com.johnsnowlabs.nlp.EmbeddingsFinisher import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("tokens") val wordEmbeddings = BertEmbeddings .pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "tokens")) .setOutputCol("word_embeddings") val nerModel = MedicalNerModel .pretrained("ner_clinical_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "tokens", "word_embeddings")) .setOutputCol("ner") val nerConverter = new NerConverter() .setInputCols("sentence", "tokens", "ner") .setOutputCol("ner_chunk") val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings .pretrained("sbluebert_base_uncased_mli", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk")) .setOutputCol("sentence_chunk_embeddings") val pipeline = new Pipeline() .setStages(Array( documentAssembler, sentenceDetector, tokenizer, wordEmbeddings, nerModel, nerConverter, sentenceChunkEmbeddings)) val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." + " He complains of swelling in his right forearm." val testDataset = Seq("").toDS.toDF("text") val result = pipeline.fit(emptyDataset).transform(testDataset) result .selectExpr("explode(sentence_chunk_embeddings) AS s") .selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding") .show(truncate=false) +-----------------------------+-----------------------------------------------------------------+ | result| averageEmbedding| +-----------------------------+-----------------------------------------------------------------+ |Her Diabetes |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072] | |type 2 |[-0.027161136, -0.24613449, -0.0949309, 0.1825444, -0.2252143] | |her Diabetes |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072] | |swelling in his right forearm|[-0.45139068, 0.12400375, -0.0075617577, -0.90806055, 0.12871636]| +-----------------------------+-----------------------------------------------------------------+
- See also
BertEmbeddings for token-level embeddings
BertSentenceEmbeddings for sentence-level embeddings
Annotators Main Page for a list of transformer based embeddings
-
class
EntityChunkEmbeddings extends BertSentenceEmbeddings with CheckLicense
Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks.
Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related). The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold. The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence "The patient was given 125 mg of paracetamol and metformin", the model will pair "125 mg" to "paracetamol", but not to "metformin". The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type. An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. "DRUG:SYMPTOM" to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.
This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.
Two input columns are required - chunks and dependency annotations.
val embeddings = EntityChunkEmbeddings.pretrained() .setInputCols("sentence", "dependencies") .setOutputCol("entity_chunk_embeddings")
The default model is
"sbiobert_base_cased_mli"
from "clinical/models".Sources :
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel import com.johnsnowlabs.nlp.annotators.ner.{MedicalNerModel, NerConverterInternal} import com.johnsnowlabs.nlp.annotators.embeddings.EntityChunkEmbeddings import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("tokens") val wordEmbeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("word_embeddings") val nerModel = MedicalNerModel .pretrained("ner_posology_large", "en", "clinical/models") .setInputCols(Array("sentence", "tokens", "word_embeddings")) .setOutputCol("ner") val nerConverter = new NerConverterInternal() .setInputCols("sentence", "tokens", "ner") .setOutputCol("ner_chunk") val posTager = PerceptronModel .pretrained("pos_clinical", "en", "clinical/models") .setInputCols("sentences", "tokens") .setOutputCol("pos_tags") val dependencyParser = DependencyParserModel .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val drugChunkEmbeddings = EntityChunkEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunks", "dependencies")) .setOutputCol("drug_chunk_embeddings") .setMaxSyntacticDistance(3) .setTargetEntities(Map("DRUG" -> List())) .setEntityWeights(Map[String, Float]("DRUG" -> 0.8f, "STRENGTH" -> 0.2f, "DOSAGE" -> 0.2f, "FORM" -> 0.5f)) val pipeline = new Pipeline() .setStages(Array( documentAssembler, sentenceDetector, tokenizer, wordEmbeddings, nerModel, nerConverter, posTager, dependencyParser, drugChunkEmbeddings)) val sampleText = "The patient was given metformin 125 mg, 250 mg of coumadin and then one pill paracetamol." val testDataset = Seq("").toDS.toDF("text") val result = pipeline.fit(emptyDataset).transform(testDataset) result .selectExpr("explode(drug_chunk_embeddings) AS drug_chunk") .selectExpr("drug_chunk.result", "slice(drug_chunk.embeddings, 1, 5) AS drugEmbedding") .show(truncate=false) +-----------------------------+-----------------------------------------------------------------+ | result| drugEmbedding| +-----------------------------+-----------------------------------------------------------------+ |metformin 125 mg |[-0.267413, 0.07614058, -0.5620966, 0.83838946, 0.8911504] | |250 mg coumadin |[0.22319649, -0.07094894, -0.6885556, 0.79176235, 0.82672405] | |one pill paracetamol |[-0.10939768, -0.29242, -0.3574444, 0.3981813, 0.79609615] | +-----------------------------+----------------------------------------------------------------+
- See also
BertEmbeddings for token-level embeddings
BertSentenceEmbeddings for sentence-level embeddings
Annotators Main Page for a list of transformer based embeddings
- class ExtractiveSummarization extends AnnotatorModel[ExtractiveSummarization] with HasSimpleAnnotate[ExtractiveSummarization] with CheckLicense
- trait ReadBertSentenceChunksTensorflowModel extends ReadTensorflowModel
- trait ReadEntityChunkEmbeddingsTensorflowModel extends ReadTensorflowModel
- trait ReadablePretrainedBertSentenceChunksModel extends ParamsAndFeaturesReadable[BertSentenceChunkEmbeddings] with HasPretrained[BertSentenceChunkEmbeddings]
- trait ReadablePretrainedEntityChunkEmbeddings extends ParamsAndFeaturesReadable[EntityChunkEmbeddings] with HasPretrained[EntityChunkEmbeddings]
Value Members
- object BertSentenceChunkEmbeddings extends ReadablePretrainedBertSentenceChunksModel with ReadBertSentenceChunksTensorflowModel with Serializable
- object EntityChunkEmbeddings extends ReadablePretrainedEntityChunkEmbeddings with ReadEntityChunkEmbeddingsTensorflowModel with Serializable
- object ExtractiveSummarization extends ParamsAndFeaturesReadable[ExtractiveSummarization] with Serializable