Elmo Embeddings


Computes contextualized word representations using character-based word representations and bidirectional LSTMs.

This model outputs fixed embeddings at each LSTM layer and a learnable aggregation of the 3 layers.

  • word_emb: the character-based word representations with shape [batch_size, max_length, 512].
  • lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024].
  • lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024].
  • elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024].

The complex architecture achieves state of the art results on several benchmarks. Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

The details are described in the paper “Deep contextualized word representations”.


How to use

embeddings = ElmoEmbeddings.pretrained("elmo", "en") \
      .setInputCols(["sentence", "token"]) \
      .setOutputCol("embeddings") \
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame(pd.DataFrame({"text": ["I love NLP"]})))
val embeddings = ElmoEmbeddings.pretrained("elmo", "en")
      .setInputCols("sentence", "token")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val result = pipeline.fit(Seq.empty["I love NLP"].toDS.toDF("text")).transform(data)
import nlu

text = ["I love NLP"]
embeddings_df = nlu.load('en.embed.elmo').predict(text, output_level='token')


	en_embed_elmo_embeddings	token
	[0.6922717690467834, -0.32613131403923035, 0.2...	I
	[-0.7348151206970215, -0.09645576030015945, -0...	love
	[0.16370956599712372, 0.1217058077454567, -0.0...	NLP

Model Information

Model Name: elmo  
Type: embeddings  
Compatibility: Spark NLP 2.4.0+  
License: Open Source  
Edition: Official  
Input Labels: [sentence, token]  
Output Labels: [word_embeddings]  
Language: [en]  
Dimension: 512 1024
Case sensitive: true  

Data Source

The model is imported from https://tfhub.dev/google/elmo/3