Reformer: The Efficient Transformer
Today, we introduce the Reformer, a Transformer model designed to handle context windows of up to 1 million words, all on a single accelerator and using only 16GB of memory.
When Transformer was announced back in 2017, it created a major shift in NLP towards large language models. Transformer based models like BERT are now a standard part of NLP toolkit ( as demonstrated in Kaggle competitions, for example ). Still, BERT and likes are far from being a final take in solving NLP tasks like summarization, question answering etc.
BERT can effectively cope only with short contexts ( even with tricks like sliding window/stride ). As Jeff Dean stated: We’d still like to be able to do much more contextual kinds of models. Like right now BERT and other models work well on hundreds of words, but not 10,000 words as context. BERT is also highly compute intensive, even for fine-tuning tasks.
This is exactly where Reformer — an incremental improvement over Transformer — comes in. We believe Reformer gives the basis for future use of Transformer models, both for long text and applications outside of natural language processing. It opens the door to summarizing large sections of text ( books, movie transcripts ) in one shot, for example. Colab Reformer demo reads in the whole Crime and Punishment at once and generates prompt seeded text in Dostoevsky's style. With such a large context window, Transformer could be used for applications beyond text, including pixels or musical notes, enabling it to be used to generate music and images.
How is Reformer able to do more with less ?
Resource hungry, full attention computation is approximated using less demanding locality-sensitive hashing ( O(LlogL) vs O(L), L is sequence length). Query, Value similarity is estimated in multiple steps: LS hash bucketing, sorting, chunking and finally standard, compute intensive attention, but now within each bucket only.
Reformer code is written using brand new Trax library, which runs with no code changes on CPU/GPU/TPU¹.
¹ Authors are not using TensorFlow, for performance and functionality reasons. Only Reformer decoder is available now, encoder is still being built.