You Don't Need a Breakthrough. You Need a Hundred Small Wins.

Andrej Karpathy — former AI lead at Tesla, cofounder at OpenAI — has been building a toy. A small language model called nanochat, trainable on a single GPU. It’s being done as an experiment. Think of it as a “teaching LLM”. It’s the kind of thing you build to learn and experiment with, not ship to production.

The interesting part isn’t the model itself, it’s what he built to improve it.

Karpathy created a tool he calls “autoresearch.” The concept is simple: point an AI agent at a codebase, give it a benchmark to optimize against, and let it run. The agent makes a change, runs the benchmark, checks if the score improved, keeps or discards, and moves on to the next experiment. Over and over, autonomously.

My first reaction was to dismiss this because I’m not planning on making my own LLM any time soon. But on closer inspection, I realized his application just happened to be improving his own AI training code. The approach is general purpose and works on anything with an objective measure as an outcome. As Karpathy put it: “It’s just a recipe/idea — give it to your agent and apply to what you care about.”

He left it running for two days. The agent ran 276 experiments on its own. It found about 20 changes that improved training efficiency by 11%. The agent tested, measured, and verified them all. The agent found configuration issues and parameter choices that were measurably wrong, things Karpathy himself hadn’t caught. Every improvement stacked.

He open-sourced it.

Then Shopify CEO Tobi Lütke saw it on X.

Odds are you’ve bought something from one of the 5.6 million stores that run on Shopify. Last year those stores processed over $300 billion in gross merchandise volume. Shopify’s annual revenue crossed $11 billion.

Every one of those transactions touches a template engine called Liquid. Lütke originally wrote Liquid in 2005. When a customer loads any product page on any Shopify store, Liquid parses the template, executes the logic, and renders the HTML. Every product page, every collection page, every checkout screen all flows through Liquid.

After 20 years, Liquid is the kind of codebase that has been optimized by experienced engineers over and over again. At Shopify’s scale, even small rendering inefficiencies multiply across billions of page loads, so performance has always mattered. But the easy gains were found a long time ago. What’s left is the kind of incremental, painstaking work where smart people dig in for weeks and maybe move the needle a few percent.

Lütke saw Karpathy’s autoresearch and adapted the approach. His team pointed an AI agent at Liquid, gave it a benchmark called ThemeRunner (which tests real Shopify themes with production-like data), and let it run.

Over roughly 120 iterations, the agent produced 93 commits that survived the keep/discard filter.

The result: Liquid now parses and renders 53% faster with 61% fewer memory allocations.

Getting a 53% improvement on a 20-year-old codebase that world-class engineers have been continuously improving is no small matter. It’s not like Shopify has been hiring B-team engineers all these years. Humans have to make tradeoff decisions every day about where to put their attention. Outsourcing that attention to an LLM turns out to be a great way to make progress on something that probably every engineer was tired of looking at.

I’ve seen this problem a hundred times.

I’ve been building software products in healthcare and life sciences for a long time. There’s always some part of a product that is just barely good enough. It’s a thorn in your side, but it’s too mission-critical to start taking apart. You can’t afford the downtime and you certainly can’t afford to break it.

So you put smart people on the case. They dig in for a month and come back with something maybe 10% better. That’s not enough to change the equation but you can’t keep throwing your best engineers at a problem forever when there’s other work to do. So you just live with it until the next pain point surfaces. Then you expend a bunch of energy to get back to a baseline you weren’t happy with in the first place.

In a less disciplined organization, someone will always propose a ground-up rewrite (nearly always a worse choice than doing nothing).

What autoresearch introduces is an automated way to stack compounding improvements: 5% here, 8% there, a configuration nobody thought to test. Run that around the clock for weeks and you make real gains not through being smarter but by being relentless in a way humans can’t sustain and robots can.

Most organizations aren’t ready for this yet.

I get asked by clients who are getting more sophisticated in their AI adoption: “What is the next level for us to stay competitive?” I keep coming back to the same idea: deploying agents that work autonomously.

The reason is exactly what Karpathy and Lütke demonstrated. When you can run autonomously against a measurable target, small gains compound. You don’t need one brilliant optimization. You need a hundred small ones that stack, discovered by agents that don’t sleep or get discouraged by a failure rate of 90%.

The key insight from autoresearch is the part most people skip past: you need a target that can be objectively measured so the AI has a way to evaluate its own output. Karpathy had validation loss. Lütke had the ThemeRunner benchmark. Without the scores, the autonomous loop has no way to know if it’s making things better or worse. Most organizations have vibes and not actual hard measures for the things they care about. Autonomy amplifies whatever you already have. If what you have is inconsistent, you’ll get inconsistent results that pile up.

The real prerequisite is getting serious about measurement. You don’t need to wait for better models or more compute. You just need to know how to keep score.

The companies that figure out the Venn diagram intersection of “this matters to the business” and “this is measurable” are the ones that will pull ahead.