If you're still using BERT-style token classification models for NER tagging in production, you should probably reevaluate.
Last summer, we replaced our token classification model with Google's Gemini 2.5 Flash Lite for NER tagging people, companies, and locations on millions of news articles per day. At first, it felt wrong and overkill to replace a well-established, standard approach with a generative model. However, on our own evaluation datasets, the LLM outperformed every BERT model we had implemented previously because it brings so much more contextual understanding to the task.
There are a few obvious advantages to using LLMs for NER tagging. For example, LLMs can easily handle text like "Michael and Jennifer Smith" and correctly extract both "Michael Smith" and "Jennifer Smith" as separate people. They are also much better at handling formatting issues and messy edge cases you inevitably encounter in real-world data at scale.
Deployment is also dramatically simpler: instead of managing model serving infrastructure, you're calling an inference API that you can parallelize and scale easily. Additionally, you automatically benefit from LLMs getting better and cheaper over time without changing anything on your end, provided that you have a solid eval dataset.
We're now processing close to 1B input tokens and producing 100M output tokens per day on this pipeline alone. The most popular pre-trained NER models on Hugging Face are still downloaded millions of times per month, which tells me that structured text extraction with LLMs is one of the most underrated applications right now.

I think that everyone working in AI should build their own agent from scratch, at least once. Not because it's hard, but because it's surprisingly easy, which is precisely the point.
Exactly one year ago today, I read Thorsten Ball's How to Build an Agent, or: The Emperor Has No Clothes. In this blog post, which is the single piece of text that most influenced me last year, he shows how to build a fully functional coding agent from scratch in under 400 lines of code.
Shortly thereafter, we did a one-day hackathon at Ren Systems, and by the end of the day, we had our own working agent running in the terminal. We didn't even use Anthropic's Claude Code back then, so we actually typed the complete agent code manually.
Only after building our own agent did I really start to understand what agents are. I had been reading about agents for months, but something fundamentally different clicked in my brain when I saw our own agent interact with the user, interpret the intent, and iteratively choose the right tools to achieve the goal. This emergent behavior is hard to appreciate without experiencing it so directly.
Agents have become common by now, but I still recommend everyone working in this space to build one from scratch. It sounds almost esoteric, but you have to touch it yourself to really feel what's going on. The moment your agent does something you didn't explicitly program it to do is when the line between deterministic code execution and something that's conscious and thinking starts to blur. You know it's an illusion, but it's a remarkably convincing one.
Minions: Stripe’s One-Shot, End-to-End Coding Agents (Stripe Engineering Blog). A fascinating two-part blog post (Part 1, Part 2) from Stripe's engineering team on how they built their internal coding agents, which they call minions. What first stood out to me is how remarkably well-written these posts are. At a time when many engineering blog posts read as if they were mostly AI-generated, a piece with this much clarity is a strong signal of Stripe's commitment to quality in everything they do.
Stripe's minions are fully unattended agents built for one-shot coding tasks. An engineer can kick off a minion from Slack, and it produces a pull request that passes CI and is ready for review, with no human interaction in between. Over a thousand PRs merged per week at Stripe are entirely minion-produced.
As someone working at a startup, I find it fascinating to see this level of investment in what I've been calling "engineering the machine that writes the code". What makes this particularly notable is that Stripe is operating in a very high-stakes environment with high demands on reliability and robustness.
Stripe's system is complex, far beyond what a startup with limited resources could build internally. But what makes it interesting is that minions were built on top of infrastructure Stripe had already developed for human engineers:
We built out devboxes for the needs of human engineers, long before LLM coding agents existed. As it turns out, parallelism, predictability, and isolation were also very desirable properties as well for Stripe engineers to be able to work most effectively. What's good for humans is good for agents, and building on this infrastructural primitive paid dividends as a natural home for LLM agents.
The most interesting technical concept in the post is what they call "blueprints." Anthropic's blog post on building effective agents distinguishes between workflows (fixed execution graphs of LLM calls) and agents (loops with tools). Blueprints are a hybrid: a state machine that interleaves agentic nodes (LLMs or agents can work non-deterministically) with deterministic nodes (e.g., linters, git operations, test runners) that don't invoke an LLM at all. The idea is to put the LLMs in a contained box for each subtask, constraining its tools and context as needed, and guarantee that certain steps always happen correctly.
A few other things stuck with me. Stripe built a centralized internal MCP server, called Toolshed, which hosts nearly 500 tools spanning internal systems and SaaS platforms, and to which all of Stripe's agents can connect. Stripe's engineers also make extensive use of agent rule files that are conditionally applied based on which subdirectory or code files the agent is working in. These rules dynamically provide their coding agents with the necessary context, rather than loading a massive global ruleset, e.g., from a CLAUDE.md file, that would bloat the context window. Notably, all coding at Stripe, whether by humans or agents, happens in sandboxed cloud developer environments called devboxes, which can be spun up in about 10 seconds with all necessary dependencies preloaded.
Our backend engineer, Jan Giacomelli, was inspired by this blog post and just last week built our own internal version: a sandboxed coding agent that one-shots tasks and creates pull requests, which we're calling a "renion." I'm very curious to try it and see where this goes. I'm a strong believer that professional engineering organizations need to engineer their own internal AI systems to some extent, because each company's development environment and requirements are different enough that general tools can't provide maximum value on their own. I'm also curious about how we can bring the "blueprint" pattern of wrapping agents in deterministic workflows to other parts of the AI-powered business logic in our backend.
Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI (No Priors Podcast). Andrej Karpathy is always worth listening to because he has the time to experiment and tinker with the latest developments in a way that most people working at companies don't. He effectively lives a few months in the future compared to the rest of us.
Two things stuck with me from this conversation. First, Karpathy frames Claws (from OpenClaw) as another layer of the AI stack: LLMs → Agents → Claws. I have never actually set up a Claw yet, but the persistent memory architecture and how "your Claw" gets to know you over time are things I want to experiment with, as this is directly relevant to what we're working on at Ren as the product becomes more agentic.
Second, his work on AutoResearch. We've discussed the concept internally at Ren multiple times over the past few months, but never found the time to actually try it. We have a concrete problem that would lend itself well to this approach: building a more efficient multi-label classifier. We currently use a relatively heavy model for it, we have abundant training data, and the objective is clear (maximize precision/recall/F1 for a given latency budget). We could just let an AutoResearch system loose on this task. What I'm missing is knowing how to set up a sandbox that's safe enough but has sufficient permissions for the agent to carry out the research on its own. The meta task would then be similar to Claws: build a system in a few markdown files that defines how the agent approaches and documents its research.
OpenClaw took the internet by storm last week. It's obviously a security nightmare, but its viral success confirms something I've believed for a while: most AI products today are fundamentally limited by requiring too much agency from their users.
There's real demand for AI that works for you, not just with you. A system that monitors, anticipates, and acts proactively instead of waiting for your next prompt.
At Ren Systems, we've been building exactly this kind of product: AI that works on behalf of the user and surfaces what matters before they ask for it. There's something almost magical when an AI system works this way.
Proactive AI is where the next value unlock lies. But building it in a way that's secure and compliant isn't something you can prompt out of Claude Code overnight.