Felipe Antolinez's Weblog

There Is No Exactly-Once Delivery in Asynchronous Task Queues

2026-07-08T11:15:00+00:00

There is no exactly-once delivery in asynchronous task queues, at least not in large-scale real-world systems. Even though many online guides and tutorials lead you to believe it exists, it’s logically impossible. This is the single most important lesson I wish someone had shared with us six years ago when we started running Celery on AWS ECS.

In practice, you can get very close to exactly-once delivery depending on your specific use case. To do this, you have to choose which side of the limit you approach: at-most-once or at-least-once delivery. Each side comes with its own trade-offs and subtleties, which is why you have to design your system very deliberately and can’t just take the default settings of any framework like Celery.

Celery is one of the most widely used task queues in Python, but I think its defaults aren’t great for most applications. Therefore, it takes a lot of work and experience to set it up to run reliably. We paid for most of these lessons with many hours of downtime and incident response, before finally arriving at a setup that works well for us and runs hundreds of thousands of tasks per day.

Our lead backend engineer Jan Giacomelli has now consolidated all these hard-learned lessons into a definitive guide for running Celery on AWS ECS. It covers all the gotchas, from why tasks get lost, stuck, or processed twice in the first place to the settings and task design patterns you can use to prevent them.

I wish this guide had existed back then: https://jangiacomelli.com/blog/celery-on-aws-ecs/

Tags: linkedin, software-engineering, infrastructure, ren

Taste and Judgment in the Product Design Process

2026-07-02T11:15:00+00:00

Taste and judgment have become the new buzzwords of the product world. Everyone agrees they matter more than ever, but I’ve rarely seen anyone define them precisely. So here’s my attempt, in computer science terms.

The classical product design process is a breadth-first search. You extensively map the problem space, interview a lot of users, explore many directions with low-fidelity prototypes, and only commit to one solution late. The process is deliberately systematic, which means it protects you from your own bad intuitions and biases by exploring the whole decision tree. It guides the search for you, at the cost of speed.

With AI, the process is becoming a depth-first search. Now you can go from a rough idea to a production-looking prototype in a few hours, effectively committing to one branch of the tree without ever having explored the others. Taste or judgment, then, is the ability to intuitively direct that depth-first search. It’s picking the right branch early, sensing when a path is a dead end, and knowing when to backtrack instead of digging deeper. Someone with great early judgment finds the solution in a fraction of the time. But if, and only if, their early decisions are right a lot more often than not.

The goal isn’t to skip the classical work. User research still matters, but I think that you can learn much faster from a working prototype than from an abstract discovery phase. Therefore, you can use the same activities in a more directed way and with a larger step size per iteration.

The danger is that every artifact generated with AI looks finished because it produces polished prototypes by default. Before, a prototype earned its polish through deliberate human effort, so its looks told you something about how well the underlying idea had been worked out. Because this signal is now gone, we have to communicate much more explicitly where in the design process a prototype sits.

I think the people who will thrive in the new process are product managers and designers who can also build, and experienced engineers with real product and business sense. Both can traverse the tree quickly, and most importantly, have the judgment to save a lot of time by being directionally right in many of their early decisions.

Tags: linkedin, ai, product, design

Quoting Jim Nielsen

2026-06-25T10:05:05.962627+00:00

So it must be that a key ingredient to blogging is simple: have a willingness to state something that seems obvious to you but nobody else is saying it.

Or if someone else is saying it, just link to them and say, “Yes!!! This!!!”

— Jim Nielsen, Jim Nielsen on the key ingredient of good blogging.

Tags: blogging, writing

The Real Value of MCP Is Connecting Systems That Were Never Meant to Talk

2026-06-18T11:15:00+00:00

I have used MCP servers every now and then in the past, but they have never saved me much time. Last week, however, I got real value out of them for the first time, and I want to share what I learned.

Using an MCP integration as a “glorified form-filling tool” for a service that already offers a purpose-built UI can be useful, but in my experience, it doesn’t meaningfully increase productivity. Instead, the key insight was to use not one but two MCP servers, from entirely different systems that were never designed to communicate, and to use the LLM to connect them.

In my case, I used the Mixpanel and Notion MCP connectors, together with Claude, to build new analytics dashboards in Mixpanel, based on the extensive documentation I had previously written in Notion about our analytics and new onboarding flow. That documentation, together with the implementation tickets, gave Claude enough context to build exactly the dashboards I needed, from only a clear but high-level description of what I wanted. The resulting dashboards only required minimal manual adjustments from me.

I came to see that MCP integrations are not really about interacting with a server in natural language. They’re the piping that an LLM can use to connect or glue together deterministic systems that weren’t designed to talk to each other. And the more services you can wire together in a useful way, the more value they can provide.

For this reason, they are a great tool for prototyping and for producing any stateful output or asset that can be reviewed and refined before it is used or shared. However, I’m a lot less convinced they’ll be useful for building enterprise workflows that automate business processes and need to be robust. At least to date, MCP servers can easily have breaking changes, aren’t versioned, and so on. And the more MCP servers are involved, the more fragile any workflow becomes.

Lastly, my practical advice for the product engineers tasked with building an MCP server is not to think of it as a wrapper around your API, but to ask yourself how to build an interface for your service that is maximally useful when combined with any other existing service, to enable unique and custom use cases for the user.

Tags: linkedin, ai, mcp, product

Quoting Tom Bedor

2026-06-12T12:39:19+00:00

[…] if reading this wasn’t worth your time, why is it worth mine?

Therefore, I’ve adopted this principle in my work:

If you are requesting human attention, demonstrate human effort.

— Tom Bedor, On the practice of sending unsolicited AI slop to colleagues.

Tags: ai, management

The Other Side of the $47 Billion Bill

2026-06-10T11:55:31+00:00

Anthropic announced two weeks ago that its run-rate revenue crossed $47 billion. With most of the coverage focusing on numbers like these, it’s easy to forget the companies that are sitting on the other side of this bill, actually spending all this money on tokens.

To put the number in perspective, Anthropic’s current run-rate revenue alone is roughly half the size of the entire global CRM market. Together with others like OpenAI and the AI revenue flowing through Amazon Web Services (AWS), Google Cloud, and Microsoft Azure, the total token spend is approaching the size of software categories like CRM and ERP systems that took decades to mature.

Almost every company agrees it needs some form of CRM or ERP system, and no one has to justify the existence of that line item. For AI, however, the case is much less settled, and CFOs are now demanding to see ROI on token spend, which can be genuinely hard to measure or attribute with general-purpose tools like Claude or ChatGPT.

On top of that, current changes to pricing models make budgeting and buying decisions even harder for enterprises. In a Stratechery interview last week, Satya Nadella described the future of software pricing as hybrid, combining a per-user model with a consumption model, because “there is real marginal cost to software” now, and that cost will be priced through. But enterprises are used to per-seat licenses and like them precisely because they are predictable. Ironically, many vendors are now moving toward usage-based pricing to control their own costs, just as buyers are asking for predictability.

Therefore, I doubt that frontier-lab revenue will continue to grow as steeply as it has over the past few months, and we might reach a temporary plateau due to uncertainty. I am convinced that this presents an opportunity for startups offering purpose-built AI systems focused on solving specific business problems, with a much more predictable pricing model and cost structure.

Tags: linkedin, ai, business

Anthropic Opus Model Degradation

2026-06-03T11:15:00+00:00

I know that the following is very unscientific and just "vibes," but in my personal experience, Anthropic's models have severely degraded since shortly before Opus 4.7 was released about six weeks ago. And my initial impression from the week or so I've spent with 4.8 is that it isn't any better.

Three months ago, Opus 4.6 was great and highly reliable. However, the performance of the recent 4.7 and 4.8 models is much less reliable for me. They often think for a long time before suddenly speeding up and answering suspiciously quickly. I now also frequently see the model correct itself in the final answer, as if it hadn't already settled on its answer during the reasoning tokens. And the models seem to lose context even in short conversations of no more than a few thousand tokens. For a while, I switched to the Opus 4.6 1M context window model instead of 4.7, but I feel that 4.6 has also degraded in recent weeks, as surprising as that may sound.

I'm reminded of the time around March/April, when Anthropic admitted that a few changes to the Claude Code harness had degraded its performance. According to their postmortem, the bugs were in the harness, not the model weights or the API. But this time, I don't think it's only the Claude Code harness, because I'm seeing the same problems in the Claude desktop app. Yesterday I asked Opus 4.8 to write some SQL queries for an investigation, a task I've done almost daily for well over a year. I usually provide the table schemas and indices, and the models normally write long, complex queries based on my instructions without trouble. This time, it just wouldn't understand what I needed, and I had to be far more explicit than ever before, even though it was a relatively simple task.

I wonder whether it's quantization or context compaction because of Anthropic's compute crunch, or whether the models have been overtrained on benchmarks with very clear task descriptions. But I need models to read between the lines and remember the context of a conversation. Otherwise, I might as well write the SQL query, code, or text myself if I have to be very explicit. I haven't tried any serious workloads on local models yet, and I'm sure I'd be even more frustrated than with the latest Opus models.

The crazy thing is that I don't want the models to be bad. I'd much rather write about the cool things I'm building with them than about how they've regressed. I genuinely want Anthropic models to be good, and the most frustrating part is knowing they were great just a few months ago, but having no way to access them now.

Tags: linkedin, ai, llms

We Should All Be Using Dependency Cooldowns

2026-05-27T11:15:00+00:00

Dependency cooldowns are a highly effective way to mitigate supply-chain attacks, which have become a lot more frequent recently. And because it’s such a simple strategy to implement, I think that every project should adopt it.

Once a malicious release of a popular package is published, the attacker’s window of opportunity is usually less than a week before it’s detected. Therefore, as William Woodruff showed, adding a one- or two-week cooldown period before adopting any new release would have prevented most of the prominent supply-chain attacks in recent months.

Implementation in uv for Python is as simple as adding this one line to your pyproject.toml:

toml exclude-newer = "1 week"

Dependabot, Renovate, and pnpm all have equivalent features.

One common pushback against this strategy is that it stops working if everyone adopts it, but I don’t think this is correct. Compromised releases get caught quickly, not because they are used in a real exploit first, but because there are researchers actively looking for them.

Dependency cooldowns, of course, don’t catch all supply-chain attacks, but for a one-line config change, they offer a lot of protection.

Tags: linkedin, security

Note on 15th May 2026

2026-05-15T11:25:00+00:00

If you're still using BERT-style token classification models for NER tagging in production, you should probably reevaluate.

Last summer, we replaced our token classification model with Google's Gemini 2.5 Flash Lite for NER tagging people, companies, and locations on millions of news articles per day. At first, it felt wrong and overkill to replace a well-established, standard approach with a generative model. However, on our own evaluation datasets, the LLM outperformed every BERT model we had implemented previously because it brings so much more contextual understanding to the task.

There are a few obvious advantages to using LLMs for NER tagging. For example, LLMs can easily handle text like "Michael and Jennifer Smith" and correctly extract both "Michael Smith" and "Jennifer Smith" as separate people. They are also much better at handling formatting issues and messy edge cases you inevitably encounter in real-world data at scale.

Deployment is also dramatically simpler: instead of managing model serving infrastructure, you're calling an inference API that you can parallelize and scale easily. Additionally, you automatically benefit from LLMs getting better and cheaper over time without changing anything on your end, provided that you have a solid eval dataset.

We're now processing close to 1B input tokens and producing 100M output tokens per day on this pipeline alone. The most popular pre-trained NER models on Hugging Face are still downloaded millions of times per month, which tells me that structured text extraction with LLMs is one of the most underrated applications right now.

Tags: linkedin, ai, llms, ren

Note on 7th May 2026

2026-05-07T11:15:00+00:00

Have you ever wondered what “the cloud” actually looks like? It’s a lot more physical than it sounds.

This Google Maps screenshot shows the so-called “Data Center Alley” in Ashburn, Northern Virginia—a cluster of warehouse-sized data centers located between a golf course and a few suburbs that look like they’re straight out of an American movie. AWS’s famous us-east-1 lives here, along with data centers from Microsoft, Google, Meta, IBM, Oracle, and many others. According to a frequently cited estimate from the local economic development office, around 70% of the world’s internet traffic passes through here. I have some doubts about whether this estimate is still accurate, but it is likely the world’s largest concentration of digital infrastructure.

As a software engineer or AI builder, it’s easy to forget that whatever services you call or build on, your code is actually moving photons through optical fibers and electrons across silicon somewhere, drawing real power from an electrical grid. AI is progressing really quickly right now, but the underlying physical infrastructure imposes constraints on how much of that progress is actually deployable.

Tags: linkedin, ai, infrastructure

Appearing Productive in The Workplace

2026-05-07T08:31:23+00:00

Appearing Productive in The Workplace

A nuanced, well-written blog post on the dangers of using AI in the workplace.

The author identifies two distinct failure modes:

Generative AI can produce work that looks expert without being expert, and the failure arrives in two shapes. The first is when novices in a field are able to produce work that resembles what their seniors produce, faster or more advanced than their judgment. The second is when people generate artifacts in disciplines they were never trained in.

Another interesting observation is on workslop, now that the cost of producing a document has fallen to nearly zero:

Requirements documents that were once a page are now twelve. Status updates that were once three sentences are now bulleted summaries of bulleted summaries. Retrospective notes, post-incident reports, design memos, kickoff decks: every artifact that can be elongated is, by people who do not read what they produce, for readers who do not read what they receive.

The author also writes about the implications for companies, which is a view I share:

For firms, the competitive advantage of a firm whose work can be trusted has not disappeared; it has, if anything, appreciated, because so many of the firm’s competitors are quietly converting themselves into content-generation pipelines and counting on the client not to notice.

Tags: ai, management

Quoting yacineMTB

2026-05-04T18:40:19+00:00

you can outsource your thinking but you cannot outsource your understanding

— yacineMTB, in a tweet that Andrej Karpathy mentioned in his AI Ascent talk “From Vibe Coding to Agentic Engineering”.

Tags: ai, coding-agents

AI Is Not Great at Writing Prompts for Itself

2026-04-29T11:15:00+00:00

Here is something interesting I've learned in the past few weeks about building AI products: AI is not great at writing prompts for itself. It's a trap I've fallen into repeatedly, and I suspect many others have too.

It usually goes like this: You start building a feature that uses one or more prompts for an LLM call, agent, or workflow with a coding agent like Claude Code. The first version works, and the output is ok, but not great. So you show your coding agent some examples and tell it what you don't like, and it edits the prompt. Now, the new output is different, and on the exact same task, it's slightly better.

After repeating this a few times, you convince yourself that the prompt is better than what you started with. However, when you look at it, you realize it's 3x longer than the original. Deep down, you know it still isn't great, but at that point, other priorities take over, and you move on to something supposedly more important.

The issue is that LLMs love producing tokens. They keep extending the prompt based on your feedback, adding some examples here, and DON'Ts and MUST NOTs all over the place. The LLM also starts decompressing itself, spelling out in more words things it obviously already knows. However, this decompression is overfit to the specific cases you iterated on, so it doesn't generalize well to other cases.

The only way out is to actually figure out what you want and why, and write it down precisely yourself, which will often cut the prompt by 90%. It's hard work, and you have to spend real brain cycles distilling all your observations into as few precise words as possible. But I think this is the only way to achieve a great result with an AI-based product for a task that isn't easy to verify or write evals for.

Tags: linkedin, ai, llms, coding-agents

Quoting Sally Kornbluth

2026-04-20T14:07:33.482921+00:00

The way you maintain meritocracy and excellence is to make sure that each person you bring in, and for us, this means all of our faculty, all of our staff, all of our students, we have to consistently focus on excellence. There was a colleague of mine at Duke who had a sign in his office that said, if you take a lick of the lollipop of mediocrity, you will suck forever.

— Sally Kornbluth, Long Strange Trip podcast interview with Sally Kornbluth, MIT's president.

Tags: management, podcast

Note on 15th April 2026

2026-04-15T11:15:00+00:00

I think that everyone working in AI should build their own agent from scratch, at least once. Not because it's hard, but because it's surprisingly easy, which is precisely the point.

Exactly one year ago today, I read Thorsten Ball's How to Build an Agent, or: The Emperor Has No Clothes. In this blog post, which is the single piece of text that most influenced me last year, he shows how to build a fully functional coding agent from scratch in under 400 lines of code.

Shortly thereafter, we did a one-day hackathon at Ren Systems, and by the end of the day, we had our own working agent running in the terminal. We didn't even use Anthropic's Claude Code back then, so we actually typed the complete agent code manually.

Only after building our own agent did I really start to understand what agents are. I had been reading about agents for months, but something fundamentally different clicked in my brain when I saw our own agent interact with the user, interpret the intent, and iteratively choose the right tools to achieve the goal. This emergent behavior is hard to appreciate without experiencing it so directly.

Agents have become common by now, but I still recommend everyone working in this space to build one from scratch. It sounds almost esoteric, but you have to touch it yourself to really feel what's going on. The moment your agent does something you didn't explicitly program it to do is when the line between deterministic code execution and something that's conscious and thinking starts to blur. You know it's an illusion, but it's a remarkably convincing one.

Tags: linkedin, coding-agents, ai, agentic-ai, ren

An AI State of the Union: We've Passed the Inflection Point, Dark Factories Are Coming, and Automation Timelines | Simon Willison (Lenny's Podcast)

2026-04-13T11:16:31.378778+00:00

An AI State of the Union: We've Passed the Inflection Point, Dark Factories Are Coming, and Automation Timelines | Simon Willison (Lenny's Podcast)

As someone who follows Simon Willison closely, this interview didn't contain many new ideas for me. But I would still strongly recommend it to anyone who doesn't follow him as closely, because it covers many of his main beliefs and insights in one place, and I consider many of them to be very strong.

The main thing I picked up from the podcast is StrongDM's work on the "dark factory", which Simon covered on his own blog and is definitely worth reading. I have heard the "dark (software) factory" term before and didn't quite understand it, but it is an analogy for manufacturing facilities so automated that the lights are literally turned off because no humans are operating the factory. The core idea of this movement is building development factory in which specs and scenarios drive coding agents to write and review (!) code without humans completely autonomously.

Other things I picked up from this podcast episode: the distinction between agentic engineering and vibe coding, using "red/green TDD" as a micro-prompt to improve coding agent output, and the strategy of building "digital twins" of external services for testing by giving a coding agent just public API docs.

Tags: ai, coding-agents, software-engineering, podcast

Minions: Stripe's One-Shot, End-to-End Coding Agents (Stripe Engineering Blog)

2026-04-12T08:10:38+00:00

Minions: Stripe's One-Shot, End-to-End Coding Agents (Stripe Engineering Blog)

A fascinating two-part blog post (Part 1, Part 2) from Stripe's engineering team on how they built their internal coding agents, which they call minions. What first stood out to me is how remarkably well-written these posts are. At a time when many engineering blog posts read as if they were mostly AI-generated, a piece with this much clarity is a strong signal of Stripe's commitment to quality in everything they do.

Stripe's minions are fully unattended agents built for one-shot coding tasks. An engineer can kick off a minion from Slack, and it produces a pull request that passes CI and is ready for review, with no human interaction in between. Over a thousand PRs merged per week at Stripe are entirely minion-produced.

As someone working at a startup, I find it fascinating to see this level of investment in what I've been calling "engineering the machine that writes the code". What makes this particularly notable is that Stripe is operating in a very high-stakes environment with high demands on reliability and robustness.

Stripe's system is complex, far beyond what a startup with limited resources could build internally. But what makes it interesting is that minions were built on top of infrastructure Stripe had already developed for human engineers:

We built out devboxes for the needs of human engineers, long before LLM coding agents existed. As it turns out, parallelism, predictability, and isolation were also very desirable properties as well for Stripe engineers to be able to work most effectively. What's good for humans is good for agents, and building on this infrastructural primitive paid dividends as a natural home for LLM agents.

The most interesting technical concept in the post is what they call "blueprints." Anthropic's blog post on building effective agents distinguishes between workflows (fixed execution graphs of LLM calls) and agents (loops with tools). Blueprints are a hybrid: a state machine that interleaves agentic nodes (LLMs or agents can work non-deterministically) with deterministic nodes (e.g., linters, git operations, test runners) that don't invoke an LLM at all. The idea is to put the LLMs in a contained box for each subtask, constraining its tools and context as needed, and guarantee that certain steps always happen correctly.

A few other things stuck with me. Stripe built a centralized internal MCP server, called Toolshed, which hosts nearly 500 tools spanning internal systems and SaaS platforms, and to which all of Stripe's agents can connect. Stripe's engineers also make extensive use of agent rule files that are conditionally applied based on which subdirectory or code files the agent is working in. These rules dynamically provide their coding agents with the necessary context, rather than loading a massive global ruleset, e.g., from a CLAUDE.md file, that would bloat the context window. Notably, all coding at Stripe, whether by humans or agents, happens in sandboxed cloud developer environments called devboxes, which can be spun up in about 10 seconds with all necessary dependencies preloaded.

Our backend engineer, Jan Giacomelli, was inspired by this blog post and just last week built our own internal version: a sandboxed coding agent that one-shots tasks and creates pull requests, which we're calling a "renion." I'm very curious to try it and see where this goes. I'm a strong believer that professional engineering organizations need to engineer their own internal AI systems to some extent, because each company's development environment and requirements are different enough that general tools can't provide maximum value on their own. I'm also curious about how we can bring the "blueprint" pattern of wrapping agents in deterministic workflows to other parts of the AI-powered business logic in our backend.

Tags: coding-agents, ai, software-engineering, ren, agentic-ai

Note on 8th April 2026

2026-04-08T11:22:50+00:00

Vibe coding and agentic engineering are two terms that come up constantly right now, but are often confused. Both describe building software where AI writes most or all of the code, but there is a fundamental difference.

Vibe coding, as Andrej Karpathy originally defined it, means building something without looking at the code. You describe what you want, see if it works, and iterate on the vibes without even intending to read or understand the code.

Agentic engineering is something very different. It's not about writing code with agents, but about engineering a system that uses agents to write code that meets specifications and is well-tested. The resulting code needs to match existing patterns in the codebase, adhere to the company's engineering principles, and pass an extensive test suite.

From what I'm seeing, the best engineering organizations right now are spending a lot of their time on building the machine that writes the code. In practice, this means aligning conventions and patterns throughout the codebase, improving the feedback loop for coding agents, automating review processes, tightening deployment practices, and wrapping LLMs in deterministic workflows.

At Ren Systems, our team has been putting a lot of work into this. Together with Jan Giacomelli and Giorgio Nicoli, we have been working on things like fully typing our test suite, writing custom skills for zero-downtime database migrations, implementing custom linters that deterministically check our desired coding patterns and provide feedback to coding agents when they are violated, and building an AI-assisted code review flow trained on our own review comments from many thousands of past code reviews.

All of this was implemented much more quickly with the help of coding agents, but it required deliberate engineering. As a result, we can now build on our codebase with agents much more quickly, and it is in a much better overall state than it was one year ago.

There is absolutely a place for vibe coding, even for professional developers. It's great for prototyping, exploring ideas, and internal tooling, where the stakes are low, and you're the only person who gets hurt if it has bugs. But this isn't what companies employ professional software engineers for, and that job isn't going away with AI.

If you define software engineering as typing out code by hand, then yes, that job is being replaced. But if you define it as engineering the machine that builds production-grade code, software engineers with these skills are going to be more valuable than ever.

Tags: linkedin, coding-agents, ai, software-engineering

An Interview with Asymco's Horace Dediu About Apple at 50 (Stratechery)

2026-04-07T06:58:51.674499+00:00

An Interview with Asymco's Horace Dediu About Apple at 50 (Stratechery)

Horace Dediu, who has worked closely with Clay Christensen, makes an interesting point in this interview about why, in his view, AI is a sustaining technology for the big incumbents rather than a disruption. Google, Microsoft, Meta, and Amazon are all pouring hundreds of billions into AI instead of being repelled by it and thinking "this isn't for us" or "our customers don't want this". So obviously they all think that AI is a sustaining technology for them.

The interesting exception is Apple, which is the only major tech company that, unlike most others, isn't sprinting to spend as much as possible on AI infrastructure. In Horace Dediu's view, Apple has always positioned itself at the interface between humans and computers, and thinks that the current AI interface (essentially a command line for natural language) isn't where they'd want to compete. Whether Apple is making a smart strategic bet by waiting for the technology to commoditize and then controlling the device and interface layer, or whether they're the one incumbent that actually is being disrupted, is the open question.

Tags: stratechery, ai, podcast, apple

Quoting Horace Dediu

2026-04-04T07:30:55.264140+00:00

When you learn engineering, you learn first science and you learn basic physics and chemistry, you learn mathematics, and you learn that things are axiomatic and things are built on top of each other so that there's consistency all the way up the stack.

[...]

But then when you go to a business school, you realize the way I put it retrospectively is that it's like equivalent of sitting around a campfire telling stories to one another.

— Horace Dediu, From Ben Thompson's interview with Asymco's Horace Dediu on Apple at 50.

Tags: stratechery, podcast

Note on 1st April 2026

2026-04-01T10:07:00+00:00

GitLab had 38 public incidents in Q1 2026, up 36% from Q1 2025. March alone had 20, almost one per working day. They even have an active incident as I write this on April 1st, and, sadly, this isn't even an April Fool's joke.

Looking at their status page history, there is a consistent pattern: a code change gets deployed, breaks production, the team identifies the bad MR, and has to revert it. CI/CD pipelines are the most frequently affected component, and the incidents are noticeably impacting our development workflow.

I get it, these things happen to us too. Having to revert bad code changes every now and then is part of working on a complex real-world production system. But what makes this difficult to accept isn't just the trend itself. GitHub's status page is arguably even worse, but at least GitHub appears to be consistently ahead on AI features. GitLab appears to be getting the worst of both worlds: increasing instability without access to the latest capabilities, such as the Claude Code integration. If you're going to break things, you should at least be shipping something valuable and exciting regularly.

I think there's a real question here about whether the industry has crossed a threshold where the speed of shipping with AI tools has outpaced existing deployment systems' ability to catch problems before they reach production. The irony is that GitLab's product is literally about helping teams ship code safely. If the company that builds the CI/CD platform is struggling with this, it might be an early signal that everyone needs to rethink how deployment safety scales when AI dramatically increases the pace of change.

Tags: linkedin, devops, ai, coding-agents

From skeptic to true believer: How OpenClaw changed my life (Lenny's Podcast)

2026-03-30T07:18:50.437529+00:00

From skeptic to true believer: How OpenClaw changed my life (Lenny's Podcast)

This is the podcast on OpenClaw I listened to this weekend after the Karpathy episode. I think I understood the appeal of a proactive system that works independently from the start, but I haven't bought into the hype so far. However, I feel that these two podcasts together have started changing my mind—not because of a single capability, but because of the apparent emergent behavior that arises once a Claw has context about you and access to real tools. Agents, as we typically think of them, are reactive: you give them a task, and they execute what they are asked to do. But I now fully realize that Claws are persistent and have personalities of their own. They run in the background, build up memory over time, check in on a schedule, and start acting on your behalf without being prompted.

Claire Vo, who was apparently a big OpenClaw skeptic when it launched, now manages nine agents across multiple Mac Minis for both personal and professional use.

The first thing that stood out to me in this conversation is how well the onboarding is apparently done. Instead of structured forms and settings pages, your Claw just asks you who it is and who you are, and you figure it out together through conversation, as if you hired a new employee. The second thing I learned is how well-crafted the default behavior of the Claw appears to be. The Claw's behavior emerges from some simple markdown files ("soul document"), but the defaults are apparently surprisingly thoughtful and lead to a really pleasant behavior. It sounds like this is something anyone working in product right now should experience firsthand.

I'm now genuinely intrigued to try it myself. To really get the full experience, you clearly need to run it on a separate machine, both for security and because you don't want to think about whether your laptop is online. I should really try setting one up on my Raspberry Pi, or just buy a Mac Mini for it. The other thing I don't really have yet is a clear use case for a Claw. I wonder whether I should try to come up with one before getting started, or whether this is something you just have to go for, because the onboarding seems good enough that the use case will emerge during the setup process.

Tags: ai, agentic-ai, podcast, product, claws

Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI (No Priors Podcast)

2026-03-29T07:50:40+00:00

Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI (No Priors Podcast)

Andrej Karpathy is always worth listening to because he has the time to experiment and tinker with the latest developments in a way that most people working at companies don't. He effectively lives a few months in the future compared to the rest of us.

Two things stuck with me from this conversation. First, Karpathy frames Claws (from OpenClaw) as another layer of the AI stack: LLMs → Agents → Claws. I have never actually set up a Claw yet, but the persistent memory architecture and how "your Claw" gets to know you over time are things I want to experiment with, as this is directly relevant to what we're working on at Ren as the product becomes more agentic.

Second, his work on AutoResearch. We've discussed the concept internally at Ren multiple times over the past few months, but never found the time to actually try it. We have a concrete problem that would lend itself well to this approach: building a more efficient multi-label classifier. We currently use a relatively heavy model for it, we have abundant training data, and the objective is clear (maximize precision/recall/F1 for a given latency budget). We could just let an AutoResearch system loose on this task. What I'm missing is knowing how to set up a sandbox that's safe enough but has sufficient permissions for the agent to carry out the research on its own. The meta task would then be similar to Claws: build a system in a few markdown files that defines how the agent approaches and documents its research.

Tags: ai, podcast, agentic-ai, coding-agents, ren, claws

Note on 26th March 2026

2026-03-26T13:09:00+00:00

One big danger of AI tools in the workplace is how much easier they make it to pursue side quests.

Side quests used to be self-regulating. You'd think "wouldn't it be cool to try this?", estimate the effort at half a day, and move on. Now you tell yourself "it takes only five minutes", and decide to just go for it out of curiosity, but it's of course never just five minutes.

The result is that you can get to the end of a day having completed ten low-priority items on your todo list very efficiently, while making zero progress on the one high-priority item that needed your attention most. Your only real defense is knowing what the highest-priority item on your list is and holding yourself accountable for making progress on it.

Tags: linkedin, ai, coding-agents

Quoting Paul Graham

2026-03-21T12:18:28+00:00

How do I keep a sane mind? Well, it's important to be married to someone sane.

I mean, it sounds like a strange compliment to describe someone as sane, but the older you are, the more you realize that's actually a fairly unique quality. And so if you're married to someone sane, and as long as you don't both freak out at the same time, then there's always someone to calm the other one down. Right?

That's the advantage. So I recommend to everyone, marry someone sane.

— Paul Graham, The Social Radars podcast interview with Paul Graham, Y Combinator founder.

Tags: podcast

Quoting Jensen Huang

2026-03-18T07:55:21.164863+00:00

The reason why Nvidia can move so fast is because we always have a unifying theory for the company, which is my job [as the CEO of the company]. I need to come up with a unifying theory for what's important and why things connect together and how they connect together and then create an organization, an organism that's really, really good at delivering on that unifying theory.

— Jensen Huang, Stratechery interview with Nvidia CEO Jensen Huang

Tags: stratechery, ai, management

Note on 17th March 2026

2026-03-17T13:00:00+00:00

Em-dashes have become a telltale sign of AI-generated text, which has created some funny side effects.

I now frequently see correct and incorrect usage of hyphens and dashes mixed in the same piece of text. This happens when someone revises a piece of AI-generated text but doesn't understand the difference between hyphens, en-dashes, and em-dashes.

It's also pretty obvious that some people have started find-replacing all em-dashes with single hyphens (-) or double hyphens (--) to hide that they used AI. Which, of course, is its own tell.

But this still doesn't hide the most obvious giveaway, which isn't the em-dash itself. LLMs almost always put spaces around em-dashes: word — word instead of word—word. My guess is that models are heavily trained on news data, where the AP style guide, most commonly used in journalism, recommends spaces around em-dashes. Books and most professional writing use them without spaces.

So if you're taking your writing seriously, there's no way around learning how to use hyphens, en-dashes, and em-dashes correctly. I wrote a short post explaining the differences on my blog: Hyphens and Dashes

Tags: linkedin, writing, ai

Hyphens and Dashes

2026-03-16T20:57:56.802522+00:00

With AI tools becoming widely adopted, em-dashes have become a telltale sign of AI-generated content. Claude and ChatGPT seem to love them, which is unfortunate, because it's now made everyone suspicious of a perfectly good punctuation mark. But this doesn't change the fact that most people (non-native and native English speakers alike) never learned or understood the difference between hyphens (-), en-dashes (–), and em-dashes (—) in the first place. I now frequently see correct usage (AI-generated) and incorrect usage mixed in the same document, which happens when people do not understand the difference and revise a piece of AI-generated text.

The differences may seem subtle at first, but incorrect usage is unprofessional and can change the meaning of a sentence. Once you understand the distinctions, it's hard to unsee when people use them incorrectly. There is also a trust dimension to this. As a reader, if someone didn't put in the effort to write correctly, how can I trust that they put in the necessary effort into the thinking behind what they wrote?

I first learned about these distinctions during my PhD when I started writing scientific papers. David Norris, my supervisor and a great writer, had a habit of returning our manuscripts with red ink covering almost every page but rarely any explanations. You were expected to figure out what was wrong. Freddy Rabouw, then a postdoc in our group, took the feedback he received seriously, dug into the rules, and created a short presentation for the whole lab. The rules are simple once you learn them, but nobody teaches them.

When to Use Which

The best single resource I know is Typography for Lawyers. Here's my condensed version with tech-world examples.

Hyphens (-) connect compound words and phrasal adjectives: AI-generated text, real-time processing, third-party API. The logic is that when two or more words work together to modify a noun, you hyphenate them before the noun. "An AI-generated response" but "the response was AI generated." One exception: don't hyphenate when an adverb ending in -ly does the work. It's highly scalable infrastructure, not highly-scalable infrastructure.

Missing hyphens can create genuine ambiguity. Consider "small business software": is it software for small businesses, or small software for businesses? With a hyphen, small-business software is clearly software for small businesses. Or "new user onboarding": is the onboarding new, or are the users new? New-user onboarding removes this ambiguity. In AI contexts, "few shot prompts" could mean a small number of shot prompts, whatever those are; few-shot prompts makes clear we're talking about prompts for few-shot learning.

En-dashes (–) are slightly wider and are used in two cases. First, they mark ranges: 2020–2023, pages 50–75, chapters 3–7. (But if you start with "from," use "to" instead: from 2020 to 2023, not from 2020–2023.) Second, they denote connections or contrasts: product–market fit, the London–New York route, key–value cache.

Em-dashes (—) create a break when commas are too weak, but colons or semicolons feel too heavy. Used well, they add rhythm and emphasis, but overused, they make text feel breathless, repetitive, and AI-generated (yes, this is a compound modifier too!).

A Bonus Tip for the AI Age

Maybe incorrect usage of hyphens is now a charming sign that someone actually wrote it themselves. Some people have caught on and started find-replacing all em-dashes with single hyphens (-) or double hyphens (--) to hide that they used AI, which is its own tell. But this still doesn't hide the most obvious giveaway, which isn't the em-dash itself. LLMs almost always put spaces around em-dashes: "word — word" instead of "word—word."

My guess is the models are overtrained on news data, and the AP style guide, which is most commonly used in journalism, recommends spaces around the em-dash. Books and most professional writing use them without spaces.

I'd been avoiding em-dashes entirely for the past year because of this association—noticing the space pattern finally lets me reclaim them for my own writing.

Tags: writing

Quoting Jenny Wen

2026-03-09T06:50:57.693883+00:00

I think actually what being an IC across this past year has taught me, is that it actually just gave me a lot of skills that I don't think I would've gained if I was just managing throughout this year.

— Jenny Wen, Lenny's Podcast interview. Wen left a director role at Figma to return to IC design work at Anthropic.

Tags: podcast, management, design

Context Windows Are Limited by Atoms, Not Bits

2026-03-01T11:45:00+00:00

There is a popular narrative in tech right now: AI progress is exponential, context windows will grow to infinity, and all vertical AI products will soon be replaced by general-purpose AI that can use all the context of your entire business. This implies that the big players like Anthropic, OpenAI, and Google, with their general-purpose agents like Claude Cowork, ChatGPT, or Gemini, will subsume all software.

I don’t think this will happen. While advertised context windows have grown to 1M or even 10M tokens, there’s a widening gap between advertised capacity and what models can reliably use. Effective context window sizes have been saturating over the past 6–12 months and remain at 200K–1M tokens for most tasks.

The reason is physics. Most people talk only about model capability, but there are actually three things to AI: atoms (the hardware), bits (the logic), and power (the energy required to move electrons through hardware to make computations). Recent breakthroughs have been almost entirely in bits, which means that AI progress in general and context window size specifically will be constrained by atoms and power.

Why Attention Is Hard to Scale

Attention in transformer models have been the basis of all AI progress in recent years.¹ However, the complexity of attention scales quadratically with the context window size, with perhaps surprising implications for memory requirements.

A 1M-token context window corresponds to roughly 5MB of plain text, which isn’t much for many tasks. However, each token doesn’t require only storing a single number. Depending on the embedding size, each token requires storing thousands of numbers across many different layers of the model. Therefore, the key–value cache for a frontier transformer model to run inference on a 1M-token context window easily requires tens or even hundreds of GB of working memory, which is many orders of magnitude more than the raw text.

Many tricks to extend context windows do so by avoiding “true” attention, where each token attends to every other token, which comes with substantial performance costs. Alternative architectures like state-space models promise sub-quadratic scaling, but none have matched current transformer-based frontier models so far. The accuracy tradeoff is likely fundamental, and there is no free lunch.

This means that, in practice, effective context length rarely exceeds half of the advertised maximum. Models exhibit a U-shaped performance curve, performing best on information at the beginning and end while degrading on context in the middle.² And even when models retrieve information perfectly, longer inputs still hurt reasoning.³ Without major breakthroughs in memory technology and power infrastructure, usable context windows are unlikely to grow substantially in the coming years.

Some numbers may help to make this concrete. A frontier model like Opus 4.6 or GPT-5.3 already requires hundreds of gigabytes just to store the weights. NVIDIA’s next-generation GPU, the Rubin R100, which should start shipping in the second half of this year, will have 288 GB of high-bandwidth memory—the same amount as the Blackwell B300 GPU, which started shipping in the second half of 2025. A single long-context session consumes most of the available memory. Therefore, production context windows have expanded on paper, but the effective ceiling at which models reason reliably has barely moved.

High-bandwidth memory and power, not compute, are currently the hard constraints. Memory is expensive, physically difficult to manufacture, and supply-constrained in the coming years.⁴ On the power side, data center electrical capacity in the US is nearly maxed out, with utility connection wait times exceeding 3–5 years.⁵ Increasing AI demand will make both constraints even tighter.

Products as Context Operating Systems

Andrej Karpathy put it well: you can think of the LLM as a CPU and the context window as RAM, which means you need something like an operating system to select and manage context.^6, ⁷ Therefore, context engineering, the practice of selecting the right data for the model's context window depending on the task, will remain essential.

This doesn’t contradict the bitter lesson.⁸ Sutton’s argument is that general methods that leverage more computation eventually outperform all hand-crafted solutions and clever engineering. This is true for algorithms and training, where scaling has consistently won, but the bitter lesson is about what’s theoretically optimal, not what’s deployable given physical constraints. Power grids and memory fabs don’t follow exponential curves. The bitter lesson describes where we’ll end up eventually, but it doesn’t tell us how long it will take to get there.

There is a large gap between the capabilities of current LLMs can do and the value delivered by current products. This product overhang is real and will remain an opportunity in the coming years. Not everything being built now will be obsolete when better models arrive, because physical constraints will limit how quickly they can be developed and deployed.

Context windows saturating might be where general AI progress stalls this time, with breakthroughs likely taking years. In the meantime, products that effectively serve as context operating systems for users and do this well will become tremendously valuable.

Tags: ai, llms, bitter-lesson, product, saas