I know that the following is very unscientific and just "vibes," but in my personal experience, Anthropic's models have severely degraded since shortly before Opus 4.7 was released about six weeks ago. And my initial impression from the week or so I've spent with 4.8 is that it isn't any better.

Three months ago, Opus 4.6 was great and highly reliable. However, the performance of the recent 4.7 and 4.8 models is much less reliable for me. They often think for a long time before suddenly speeding up and answering suspiciously quickly. I now also frequently see the model correct itself in the final answer, as if it hadn't already settled on its answer during the reasoning tokens. And the models seem to lose context even in short conversations of no more than a few thousand tokens. For a while, I switched to the Opus 4.6 1M context window model instead of 4.7, but I feel that 4.6 has also degraded in recent weeks, as surprising as that may sound.

I'm reminded of the time around March/April, when Anthropic admitted that a few changes to the Claude Code harness had degraded its performance. According to their postmortem, the bugs were in the harness, not the model weights or the API. But this time, I don't think it's only the Claude Code harness, because I'm seeing the same problems in the Claude desktop app. Yesterday I asked Opus 4.8 to write some SQL queries for an investigation, a task I've done almost daily for well over a year. I usually provide the table schemas and indices, and the models normally write long, complex queries based on my instructions without trouble. This time, it just wouldn't understand what I needed, and I had to be far more explicit than ever before, even though it was a relatively simple task.

I wonder whether it's quantization or context compaction because of Anthropic's compute crunch, or whether the models have been overtrained on benchmarks with very clear task descriptions. But I need models to read between the lines and remember the context of a conversation. Otherwise, I might as well write the SQL query, code, or text myself if I have to be very explicit. I haven't tried any serious workloads on local models yet, and I'm sure I'd be even more frustrated than with the latest Opus models.

The crazy thing is that I don't want the models to be bad. I'd much rather write about the cool things I'm building with them than about how they've regressed. I genuinely want Anthropic models to be good, and the most frustrating part is knowing they were great just a few months ago, but having no way to access them now.

View the original LinkedIn post

Recent articles