Longformer Large (longformer_large_4096)

Description

Longformer is a transformer model for long documents.

longformer_large_4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.

Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.

If you use Longformer in your research, please cite Longformer: The Long-Document Transformer.

@article{Beltagy2020Longformer,
title={Longformer: The Long-Document Transformer},
author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
journal={arXiv:2004.05150},
year={2020},
}

Longformer is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Download Copy S3 URI

How to use

embeddings = LongformerEmbeddings\
.pretrained("longformer_large_4096")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")\
.setCaseSensitive(True)\
.setMaxSentenceLength(4096)

val embeddings = LongformerEmbeddings.pretrained("longformer_large_4096", "en")
.setInputCols("document", "token") 
.setOutputCol("embeddings")
.setCaseSensitive(true)
.setMaxSentenceLength(4096)

import nlu
nlu.load("en.embed.longformer.large").predict("""Put your text here.""")

Model Information

Model Name:	longformer_large_4096
Compatibility:	Spark NLP 3.2.0+
License:	Open Source
Edition:	Official
Input Labels:	[token, sentence]
Output Labels:	[embeddings]
Language:	en
Case sensitive:	true
Max sentense length:	4096

Data Source

https://huggingface.co/allenai/longformer-large-4096

PREVIOUSLongformer Base (longformer_base_4096)

NEXTNamed Entity Recognition - CoNLL03 BERT (ner_conll_bert_base_cased)