Context Windows Are Limited by Atoms, Not Bits

1st March 2026

There is a popular narrative in tech right now: AI progress is exponential, context windows will grow to infinity, and all vertical AI products will soon be replaced by general-purpose AI that can use all the context of your entire business. This implies that the big players like Anthropic, OpenAI, and Google, with their general-purpose agents like Claude Cowork, ChatGPT, or Gemini, will subsume all software.

I don’t think this will happen. While advertised context windows have grown to 1M or even 10M tokens, there’s a widening gap between advertised capacity and what models can reliably use. Effective context window sizes have been saturating over the past 6–12 months and remain at 200K–1M tokens for most tasks.

The reason is physics. Most people talk only about model capability, but there are actually three things to AI: atoms (the hardware), bits (the logic), and power (the energy required to move electrons through hardware to make computations). Recent breakthroughs have been almost entirely in bits, which means that AI progress in general and context window size specifically will be constrained by atoms and power.

Why Attention Is Hard to Scale

Attention in transformer models have been the basis of all AI progress in recent years.¹ However, the complexity of attention scales quadratically with the context window size, with perhaps surprising implications for memory requirements.

A 1M-token context window corresponds to roughly 5MB of plain text, which isn’t much for many tasks. However, each token doesn’t require only storing a single number. Depending on the embedding size, each token requires storing thousands of numbers across many different layers of the model. Therefore, the key–value cache for a frontier transformer model to run inference on a 1M-token context window easily requires tens or even hundreds of GB of working memory, which is many orders of magnitude more than the raw text.

Many tricks to extend context windows do so by avoiding “true” attention, where each token attends to every other token, which comes with substantial performance costs. Alternative architectures like state-space models promise sub-quadratic scaling, but none have matched current transformer-based frontier models so far. The accuracy tradeoff is likely fundamental, and there is no free lunch.

This means that, in practice, effective context length rarely exceeds half of the advertised maximum. Models exhibit a U-shaped performance curve, performing best on information at the beginning and end while degrading on context in the middle.² And even when models retrieve information perfectly, longer inputs still hurt reasoning.³ Without major breakthroughs in memory technology and power infrastructure, usable context windows are unlikely to grow substantially in the coming years.

Some numbers may help to make this concrete. A frontier model like Opus 4.6 or GPT-5.3 already requires hundreds of gigabytes just to store the weights. NVIDIA’s next-generation GPU, the Rubin R100, which should start shipping in the second half of this year, will have 288 GB of high-bandwidth memory—the same amount as the Blackwell B300 GPU, which started shipping in the second half of 2025. A single long-context session consumes most of the available memory. Therefore, production context windows have expanded on paper, but the effective ceiling at which models reason reliably has barely moved.

High-bandwidth memory and power, not compute, are currently the hard constraints. Memory is expensive, physically difficult to manufacture, and supply-constrained in the coming years.⁴ On the power side, data center electrical capacity in the US is nearly maxed out, with utility connection wait times exceeding 3–5 years.⁵ Increasing AI demand will make both constraints even tighter.

Products as Context Operating Systems

Andrej Karpathy put it well: you can think of the LLM as a CPU and the context window as RAM, which means you need something like an operating system to select and manage context.^6, ⁷ Therefore, context engineering, the practice of selecting the right data for the model’s context window depending on the task, will remain essential.

This doesn’t contradict the bitter lesson.⁸ Sutton’s argument is that general methods that leverage more computation eventually outperform all hand-crafted solutions and clever engineering. This is true for algorithms and training, where scaling has consistently won, but the bitter lesson is about what’s theoretically optimal, not what’s deployable given physical constraints. Power grids and memory fabs don’t follow exponential curves. The bitter lesson describes where we’ll end up eventually, but it doesn’t tell us how long it will take to get there.

There is a large gap between the capabilities of current LLMs can do and the value delivered by current products. This product overhang is real and will remain an opportunity in the coming years. Not everything being built now will be obsolete when better models arrive, because physical constraints will limit how quickly they can be developed and deployed.

Context windows saturating might be where general AI progress stalls this time, with breakthroughs likely taking years. In the meantime, products that effectively serve as context operating systems for users and do this well will become tremendously valuable.

Posted 1st March 2026 at 11:45 am

« Felipe Antolinez »