Global Convolutional Language Models: Long-Context Sequence Modeling via Frequency-Domain Mixing

Abstract

The dominant models for long-context sequence processing rely on self-attention to integrate information across distant positions. Although effective, the quadratic time and memory cost of attention presents substantial challenges as sequence lengths increase. In this work, we propose Global Convolutional Language Models (GCLMs), a family of architectures that replace attention with a combination of frequency-domain global convolution and local depthwise convolution. The global operator applies a learned sequence-length–sized convolution kernel using the Fast Fourier Transform, enabling O(n log n) global mixing while preserving the parallelism and stability of convolutional networks. Local convolutions complement this mechanism by capturing short-range structure.

Experiments demonstrate that GCLMs train stably at long context lengths on consumer GPUs, exhibit rapid convergence even at small scale, and achieve low loss values within fractions of an epoch. These findings suggest that global convolution provides an efficient and practical alternative to attention for long-context language modeling.

See the full paper here.