Infinite Context Transformers

3 min readApr 21, 2024

A new paper by researchers at Google claims to give large language models (LLMs) the ability to work with text of infinite length.

Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a fnite context window and the expensive computational cost of processing long text documents.

Why do we have the transformer context length problem ?

Each input token plays query, key, value¹ roles simultaneously, leading to O(n²) complexity ( n is the sequence length ).

How does Infini-attention solve it ?

It keeps compressed past context handy³ to combine with the current context. It can attend jointly ( dot + linear attention ) across the compressed past and the current context.

How is it achieved ?

Similarly² to how RAG is used to store / retrieve facts in conjunction with LLM, memory external to NN ( Transformer, LSTM ) is used to increase its capacity

Why not just use standard memory for fact read / write ?

The most standard associative memory system is a map data structure (e.g., hashmap, binary tree, relational database); unfortunately, these do not generalize across inputs — either an input is found exactly or it is not. We are interested in memories that can generalize beyond exact lookups, and
can learn to do so based on past successes and failures in
an incremental, online manner.

The state sharing and reusing between the dot-product attention and compressive
memory not only enables efficient plug-in-play long-context adaptation but also speeds up training and inference.

Linear ( efficient ) attention vs standard dot attention

Our core observation is that in dot product attention
there are two consecutive matrix multiplications. Therefore we can utilize the
associativity of matrix multiplication to switch the order of computation and
massively reduce the size of the intermediate output and therefore reduce the complexity from quadratic to linear. Efficient attention module can
establish long-range interactions in one shot in linear space and time.

How are segments tied together ?

The new memory states Ms and zs are then passed to the next segment S + 1, building in a recurrence in each attention layer.

How are local and global attention aggregated ?

Compressive memory is written to via slow RNN ( LSTM ) directly generating fast NN weights.

¹ What are the “query”, “key”, and “value” vectors ? They’re abstractions that are useful for calculating and thinking about attention.

Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that word. We end up creating a “query”, a “key”, and a “value” projection of each word in the input sentence.

² The key difference is that in NN case external memory is differentiable i.e. it can be learned.

³ Cache or hot / cold ( archive ) data style. Being able to compress well is closely related to intelligence.

⁴ Reading from the neural memory function amounts to pushing an input (the key vector) through the function to produce an output (the value vector). Writing to memory means changing the function; specifically, updating the parameters of the neural network to encode desired information.