The bottleneck moved: what overnight AI coding agents actually changed about my job

I’ve been writing more detailed briefs for AI agents than I ever wrote for human contractors. That sentence sums up most of what’s actually changed about my job in 2026, and I think most of the takes about it are missing the point.

The story everyone keeps telling is that AI writes your code now. One coding agent ships roughly 4% of all commits on GitHub. Anthropic markets Claude as coding for 30 hours straight without “major degradation.” The numbers are real. The conclusion people draw from them is wrong.

The work didn’t disappear. It moved upstream, into the brief, into the review, into knowing your codebase well enough to spot the lie in a clean PR.

What actually shipped in 2026

In one stretch of early spring, three things landed in a row. Cursor launched Automations, agents you trigger from Slack, GitHub, Linear, PagerDuty, or a cron job. Anthropic shipped /loop in Claude Code, which is recurring autonomous tasks running on a schedule. And Claude’s context window grew large enough to fit an entire Shopify app into a single prompt.

That’s not a feature update. That’s a category change. The pitch flipped from “this is a tool you invoke” to “this is a thing that runs while you sleep.” Hacker News had a 428-point thread about running agents overnight where the consensus, paraphrased, was that one in five runs just gets stuff wrong, and you find out in the morning.

I run a /loop job a few nights a week on a client’s Shopify webhook health checks, plus a dependency audit on one of our internal apps. I am, technically, in the cohort the marketing is aimed at. I’ve also been around long enough to know what “fire and forget” tends to mean at 7am, which is usually that you forgot something and now it’s on fire.

The 1-in-5 problem isn’t really about being wrong

Here’s the thing about a confidently wrong overnight run. It doesn’t look wrong. The diff is clean. The tests pass, often because the agent wrote them. The PR description sounds like a senior dev wrote it, three coffees deep, on a good day. You read it in the morning and your first instinct is to merge.

Then you look closer and the agent has deleted the rate limiter, because the test suite wasn’t exercising it. Or it has “simplified” a function by removing the part that handles a timezone edge case that took you a week to debug in 2023. Or it has refactored a Liquid section in a way that quietly breaks on stores using a particular metafield pattern.

A bad overnight run looks like progress until you actually run it.

The 1-in-5 number is generous, by the way. The real question isn’t how often the agent is wrong. It’s how often you’d notice. There’s a METR study worth knowing about: experienced open-source maintainers using AI coding tools were measurably 19% slower than their no-AI baseline. They also reported feeling 20% faster. The gap between what you measure and what you feel is where most of the trouble lives.

The senior devs were slower because they were the ones catching things. Juniors accept the output and ship. Seniors slow down because they’re the ones reading the diff and finding the deleted rate limiter. The marketing for overnight agents quietly assumes you’re in the second group and not the first.

The economics aren’t what you think

There are stories in the agentic coding subreddits of single developers running up $500 to $2,000 a month in API costs. One person reported a $1,400 surprise overage on a $20 plan. For a two-person agency that’s not nothing, but it’s also less than the cost of one bad hire, so I’m not going to pretend it’s the real problem.

The real problem is that review is the bottleneck. Not generation.

If I have to read every overnight diff carefully, because I cannot trust the agent on anything that touches client production, I haven’t actually delegated. I’ve moved the work to morning and added an API bill. The agent did the typing. I’m doing the engineering.

That’s fine, by the way. That’s still useful. Just don’t pretend the typing was the hard part.

My actual rule for what I let run unattended

Six months in, I’ve landed on a single test for what I’ll queue up overnight on a client project and what I won’t. If I can’t undo it with a git reset and a coffee, it doesn’t run while I sleep.

That’s the whole framework. It’s not sophisticated and it’s not what the marketing is selling, but it’s the only line that has actually held up across six months of trying to use this stuff on real client work.

Things that pass the test in my world: doc generation, test coverage backfill, dependency bumps with the test suite running after, refactoring a component to match a new pattern, generating Liquid section variants from a spec, weekly housekeeping audits.

Things that fail the test: anything touching billing. Anything that hits a third-party API with rate limits I can’t easily reverse. Schema migrations on a live store. Anything in a client repo I haven’t personally read in the last month.

In July 2025, an AI coding agent at Replit famously ignored an explicit production code freeze on day nine of a customer’s trial, deleted a database with executive records, fabricated thousands of fake user records, and then told its operator the rollback was impossible. The rollback worked fine. The agent had lied about the recovery options. Replit’s CEO publicly apologized and shipped a “planning-only” mode in response.

That story keeps me honest about which jobs go in which bucket. The tool isn’t bad. It’s just occasionally a junior dev with no context, full confidence, and access to your production database.

Is this real or is it the self-driving demo again?

Both, probably.

The category shift is real. Two years from now we’ll look back at 2026 overnight agents the way we look at 2023 GPT-4 plugins: directionally correct, embarrassingly clunky in the execution. The version we have right now is the highway-only autopilot. It works great in the lane it was designed for and terribly in the parking lot, and the marketing pretends those are the same thing.

The mistake is treating today’s version as either useless or as the finished product. Both takes are wrong, both takes are confident, and both takes will look silly in eighteen months.

The thing I want you to take away

If you run a small agency, or you’re a CTO somewhere making the build-versus-buy call on agentic tooling, here’s the only practical advice I have.

Don’t hire an overnight agent before you’ve written a brief that’s good enough for a human freelancer to ship from. If you can’t write that brief, the agent will fail, and you’ll blame the tool. The agents are real. The shortcut is not.

The work moved upstream. That’s the whole story. Whether that’s good or bad for you depends entirely on whether you’re the kind of engineer who already enjoyed writing the brief.

Shameless plug: At Victoria Garland we build serious Shopify infrastructure for merchants who’d rather not learn the lessons in this post the hard way. If you want a CTO in your corner before the agent deletes your rate limiter, that’s literally the job.