Felipe Antolinez's Weblog: llms

Anthropic Opus Model Degradation

2026-06-03T11:15:00+00:00

I know that the following is very unscientific and just "vibes," but in my personal experience, Anthropic's models have severely degraded since shortly before Opus 4.7 was released about six weeks ago. And my initial impression from the week or so I've spent with 4.8 is that it isn't any better.

Three months ago, Opus 4.6 was great and highly reliable. However, the performance of the recent 4.7 and 4.8 models is much less reliable for me. They often think for a long time before suddenly speeding up and answering suspiciously quickly. I now also frequently see the model correct itself in the final answer, as if it hadn't already settled on its answer during the reasoning tokens. And the models seem to lose context even in short conversations of no more than a few thousand tokens. For a while, I switched to the Opus 4.6 1M context window model instead of 4.7, but I feel that 4.6 has also degraded in recent weeks, as surprising as that may sound.

I'm reminded of the time around March/April, when Anthropic admitted that a few changes to the Claude Code harness had degraded its performance. According to their postmortem, the bugs were in the harness, not the model weights or the API. But this time, I don't think it's only the Claude Code harness, because I'm seeing the same problems in the Claude desktop app. Yesterday I asked Opus 4.8 to write some SQL queries for an investigation, a task I've done almost daily for well over a year. I usually provide the table schemas and indices, and the models normally write long, complex queries based on my instructions without trouble. This time, it just wouldn't understand what I needed, and I had to be far more explicit than ever before, even though it was a relatively simple task.

I wonder whether it's quantization or context compaction because of Anthropic's compute crunch, or whether the models have been overtrained on benchmarks with very clear task descriptions. But I need models to read between the lines and remember the context of a conversation. Otherwise, I might as well write the SQL query, code, or text myself if I have to be very explicit. I haven't tried any serious workloads on local models yet, and I'm sure I'd be even more frustrated than with the latest Opus models.

The crazy thing is that I don't want the models to be bad. I'd much rather write about the cool things I'm building with them than about how they've regressed. I genuinely want Anthropic models to be good, and the most frustrating part is knowing they were great just a few months ago, but having no way to access them now.

View the original LinkedIn post

Tags: linkedin, ai, llms

Note on 15th May 2026

2026-05-15T11:25:00+00:00

If you're still using BERT-style token classification models for NER tagging in production, you should probably reevaluate.

Last summer, we replaced our token classification model with Google's Gemini 2.5 Flash Lite for NER tagging people, companies, and locations on millions of news articles per day. At first, it felt wrong and overkill to replace a well-established, standard approach with a generative model. However, on our own evaluation datasets, the LLM outperformed every BERT model we had implemented previously because it brings so much more contextual understanding to the task.

There are a few obvious advantages to using LLMs for NER tagging. For example, LLMs can easily handle text like "Michael and Jennifer Smith" and correctly extract both "Michael Smith" and "Jennifer Smith" as separate people. They are also much better at handling formatting issues and messy edge cases you inevitably encounter in real-world data at scale.

Deployment is also dramatically simpler: instead of managing model serving infrastructure, you're calling an inference API that you can parallelize and scale easily. Additionally, you automatically benefit from LLMs getting better and cheaper over time without changing anything on your end, provided that you have a solid eval dataset.

We're now processing close to 1B input tokens and producing 100M output tokens per day on this pipeline alone. The most popular pre-trained NER models on Hugging Face are still downloaded millions of times per month, which tells me that structured text extraction with LLMs is one of the most underrated applications right now.

View the original LinkedIn post

Tags: linkedin, ai, llms, ren

AI Is Not Great at Writing Prompts for Itself

2026-04-29T11:15:00+00:00

Here is something interesting I've learned in the past few weeks about building AI products: AI is not great at writing prompts for itself. It's a trap I've fallen into repeatedly, and I suspect many others have too.

It usually goes like this: You start building a feature that uses one or more prompts for an LLM call, agent, or workflow with a coding agent like Claude Code. The first version works, and the output is ok, but not great. So you show your coding agent some examples and tell it what you don't like, and it edits the prompt. Now, the new output is different, and on the exact same task, it's slightly better.

After repeating this a few times, you convince yourself that the prompt is better than what you started with. However, when you look at it, you realize it's 3x longer than the original. Deep down, you know it still isn't great, but at that point, other priorities take over, and you move on to something supposedly more important.

The issue is that LLMs love producing tokens. They keep extending the prompt based on your feedback, adding some examples here, and DON'Ts and MUST NOTs all over the place. The LLM also starts decompressing itself, spelling out in more words things it obviously already knows. However, this decompression is overfit to the specific cases you iterated on, so it doesn't generalize well to other cases.

The only way out is to actually figure out what you want and why, and write it down precisely yourself, which will often cut the prompt by 90%. It's hard work, and you have to spend real brain cycles distilling all your observations into as few precise words as possible. But I think this is the only way to achieve a great result with an AI-based product for a task that isn't easy to verify or write evals for.

View the original LinkedIn post

Tags: linkedin, ai, llms, coding-agents

Context Windows Are Limited by Atoms, Not Bits

2026-03-01T11:45:00+00:00

There is a popular narrative in tech right now: AI progress is exponential, context windows will grow to infinity, and all vertical AI products will soon be replaced by general-purpose AI that can use all the context of your entire business. This implies that the big players like Anthropic, OpenAI, and Google, with their general-purpose agents like Claude Cowork, ChatGPT, or Gemini, will subsume all software.

I don’t think this will happen. While advertised context windows have grown to 1M or even 10M tokens, there’s a widening gap between advertised capacity and what models can reliably use. Effective context window sizes have been saturating over the past 6–12 months and remain at 200K–1M tokens for most tasks.

The reason is physics. Most people talk only about model capability, but there are actually three things to AI: atoms (the hardware), bits (the logic), and power (the energy required to move electrons through hardware to make computations). Recent breakthroughs have been almost entirely in bits, which means that AI progress in general and context window size specifically will be constrained by atoms and power.

Why Attention Is Hard to Scale

Attention in transformer models have been the basis of all AI progress in recent years.¹ However, the complexity of attention scales quadratically with the context window size, with perhaps surprising implications for memory requirements.

A 1M-token context window corresponds to roughly 5MB of plain text, which isn’t much for many tasks. However, each token doesn’t require only storing a single number. Depending on the embedding size, each token requires storing thousands of numbers across many different layers of the model. Therefore, the key–value cache for a frontier transformer model to run inference on a 1M-token context window easily requires tens or even hundreds of GB of working memory, which is many orders of magnitude more than the raw text.

Many tricks to extend context windows do so by avoiding “true” attention, where each token attends to every other token, which comes with substantial performance costs. Alternative architectures like state-space models promise sub-quadratic scaling, but none have matched current transformer-based frontier models so far. The accuracy tradeoff is likely fundamental, and there is no free lunch.

This means that, in practice, effective context length rarely exceeds half of the advertised maximum. Models exhibit a U-shaped performance curve, performing best on information at the beginning and end while degrading on context in the middle.² And even when models retrieve information perfectly, longer inputs still hurt reasoning.³ Without major breakthroughs in memory technology and power infrastructure, usable context windows are unlikely to grow substantially in the coming years.

Some numbers may help to make this concrete. A frontier model like Opus 4.6 or GPT-5.3 already requires hundreds of gigabytes just to store the weights. NVIDIA’s next-generation GPU, the Rubin R100, which should start shipping in the second half of this year, will have 288 GB of high-bandwidth memory—the same amount as the Blackwell B300 GPU, which started shipping in the second half of 2025. A single long-context session consumes most of the available memory. Therefore, production context windows have expanded on paper, but the effective ceiling at which models reason reliably has barely moved.

High-bandwidth memory and power, not compute, are currently the hard constraints. Memory is expensive, physically difficult to manufacture, and supply-constrained in the coming years.⁴ On the power side, data center electrical capacity in the US is nearly maxed out, with utility connection wait times exceeding 3–5 years.⁵ Increasing AI demand will make both constraints even tighter.

Products as Context Operating Systems

Andrej Karpathy put it well: you can think of the LLM as a CPU and the context window as RAM, which means you need something like an operating system to select and manage context.^6, ⁷ Therefore, context engineering, the practice of selecting the right data for the model's context window depending on the task, will remain essential.

This doesn’t contradict the bitter lesson.⁸ Sutton’s argument is that general methods that leverage more computation eventually outperform all hand-crafted solutions and clever engineering. This is true for algorithms and training, where scaling has consistently won, but the bitter lesson is about what’s theoretically optimal, not what’s deployable given physical constraints. Power grids and memory fabs don’t follow exponential curves. The bitter lesson describes where we’ll end up eventually, but it doesn’t tell us how long it will take to get there.

There is a large gap between the capabilities of current LLMs can do and the value delivered by current products. This product overhang is real and will remain an opportunity in the coming years. Not everything being built now will be obsolete when better models arrive, because physical constraints will limit how quickly they can be developed and deployed.

Context windows saturating might be where general AI progress stalls this time, with breakthroughs likely taking years. In the meantime, products that effectively serve as context operating systems for users and do this well will become tremendously valuable.

Tags: ai, llms, bitter-lesson, product, saas

10 Years Building Vertical Software: My Perspective on the Selloff

2026-02-26T11:55:51.305492+00:00

10 Years Building Vertical Software: My Perspective on the Selloff

Nicolas Bustamante, who has built vertical software on both sides of the LLM disruption (Doctrine for legal, Fintool for equity research), wrote a moat-by-moat analysis of vertical SaaS that is worth reading. In his view, five moats collapse (learned interfaces, custom workflows, public data access, talent scarcity, and bundling), while five hold (proprietary data, regulatory lock-in, network effects, transaction embedding, and system-of-record status).

A few things I think are missing. The biggest threat to vertical software incumbents probably isn't scrappy AI startups building 80% of the features at 20% of the cost (like his new Fintool company). It's that products like Claude Cowork can do 80% of what vertical software does out of the box, with general agents and data access, at marginal implementation cost. Once integrated, enterprises might trust Anthropic, OpenAI, and Google more than they trust a vibe-coded startup.

There's also a scenario Bustamante doesn't address: LLMs themselves will likely commoditize. If that happens, model providers will have to fight for companies and startups to use their tokens. That's precisely why Anthropic, OpenAI, and Google are strongly pushing into the product space themselves, because products might be more defensible than models. This raises an uncomfortable question for Bustamante's own company, Fintool, which he doesn't address. If what they built is, as he describes, essentially markdown skill files integrating with MCPs and foundation model APIs, what's their justification against the model providers doing the same thing?

Tags: ai, llms, startups, saas

Note on 25th February 2026

2026-02-25T13:00:00+00:00

Is this where LLMs picked up their famous sycophantic phrase and behavior?

Currently reading Conscious Business by Fred Kofman, a classic on values and authentic communication at work, and stumbled across this on page 57.

View the original LinkedIn post

On Conscious Business by Fred Kofman

Tags: linkedin, ai, llms, management

Note on 10th February 2026

2026-02-10T13:00:00+00:00

For the first time, you can build a competitive recommender system without a single user.

The playbook behind the flywheel of many of the most powerful companies has been: build a platform, measure engagement data, identify patterns, and make better recommendations to attract more users. That's the network effect that made Google, Meta, and Amazon so hard to compete with.

But LLMs have a compressed representation of that same knowledge from training on vast amounts of the internet. So to overcome the cold-start problem, you no longer need to measure and train on years of engagement data from your own users.

Now you can just give an LLM some context of a user, for example, their social media profile, to get high-quality recommendations. It's a fundamentally different entry point to personalization, one that doesn't depend on scale.

Are LLMs about to break the recommender-system moat?

View the original LinkedIn post

Tags: linkedin, ai, llms

An Interview with Benedict Evans About AI and Software

2026-02-08T09:09:10+00:00

An Interview with Benedict Evans About AI and Software

Benedict Evans articulates a great insight really well here: LLMs might be a real threat to recommender system moats.

The playbook to build a flywheel has been the following: build a platform, measure engagement data, find patterns, and make better recommendations to attract more users. That's the network effect that made Google, Meta, and Amazon so hard to compete with. But LLMs have a compressed representation of that same knowledge from training on vast amounts of the internet, without having to measure engagement of real users.

To overcome the cold-start problem, you don't need years of engagement data anymore. Now you can just give an LLM some context of a user, for example their social media profile, to get high-quality recommendations. It's a fundamentally different entry point to personalization, one that doesn't depend on scale.

Tags: stratechery, ai, llms, recommender-systems

Note on 23rd January 2026

2026-01-23T14:00:00+00:00

LLMs had a rough start to 2026.

Last week, our AI agent told a user their Monday meeting was on Sunday. The reason? In 2025, January 12th was a Sunday—and that's what the model learned during training.

We're seeing this pattern across many date-related tasks: models confidently inferring the wrong day of the week, or hallucinating years that weren't in the input text.

The obvious fix is to inject today's date into every prompt. However, that's not always what you want. For example, if you're extracting dates from text, adding context about "today" can bias the output and cause hallucinations of its own.

Is this the new Y2K bug we'll have to deal with every January? Or will future models be more robust to year transitions?

View the original LinkedIn post

Tags: linkedin, ai, llms

Note on 16th December 2025

2025-12-16T14:00:00+00:00

Last weekend, my son was working on a Bandolino puzzle where he had to match questions to answers with a piece of string.

OpenAI just released GPT–5.2 last week, claiming it "performs at or above human expert level" across vision, math, and physics benchmarks [1]. I was curious how long it would take GPT–5.2 to solve it—it failed completely.

How can such a powerful model fail at a task that a four-year-old solved in under a minute after seeing it for the first time?

Andrej Karpathy coined the term "jagged intelligence" for this phenomenon [2]. LLMs can solve complex problems that seem hard to humans while failing at tasks that seem trivially easy. Unlike human intelligence, where abilities tend to correlate and develop together, LLM capabilities are jagged and unpredictable.

To put it another way: while these models are extremely powerful, they can't be trusted.

What does this imply for deploying LLMs in production settings?

(1) Benchmarks give you no guarantees—you have to evaluate models on your own tasks.

(2) Your overall system has to be tolerant of these jagged edges. Use LLMs for the tasks that they are good at and keep a human in the loop on all critical decisions.

(3) You have to take security extremely seriously. Meta's "Agents Rule of Two" is a great framework for AI agent security that is simple to remember and apply in practice [3].

At Ren Systems, we leverage LLMs extensively to create value for our users. But it's always our users who ultimately take action.

What's one lesson you've learned deploying LLMs in production?

(References/links in first comment below)

References/links

[1] Introducing GPT-5.2 – https://openai.com/index/introducing-gpt-5-2/

[2] Andrej Karpathy on jagged intelligence – https://x.com/karpathy/status/1816531576228053133

[3] Meta's Agents Rule of Two – https://ai.meta.com/blog/practical-ai-agent-security/

View the original LinkedIn post

Tags: linkedin, ai, llms

Quoting Rodney Brooks, Robust.AI

2023-05-21T12:00:00+00:00

What the large language models are good at is saying what an answer should sound like, which is different from what an answer should be.

— Rodney Brooks, Robust.AI

Tags: ai, llms