Back to blog

Essay

The Feature Flag Lifecycle in the Age of AI-Written Code

Jeff Dwyer·June 19, 2026

When AI agents write most of the code, flags get created far faster than humans retire them. Every guarded change adds a toggle; almost nothing takes one away. Call it flag explosion. The scarce discipline is no longer creating flags — a dashboard click or a CLI call was never the bottleneck — it's managing their full lifecycle: birth, rollout, and a deliberate, safe death. That lifecycle is the hard part, and it is what a flag system now has to be good at.

This is an argument, not a pitch. I build a feature flag platform, so I have a horse in this race and I'll name it at the end — but the parts worth your time are true no matter who built the tool: where flags help in an agent-heavy codebase, where they emphatically do not, and why the unsolved problem is retirement, not creation.

Three different things people call "AI feature flags"

Vendors blur three problems that have almost nothing in common. Keeping them apart is the prerequisite for thinking clearly.

Flagging AI-written code. An agent ships a change behind a toggle so that landing in main is decoupled from exposing the behavior to users. This is ordinary trunk-based development at a new volume. The flag guards code; the code happens to be machine-authored.

Flagging AI models and prompts. The thing behind the flag isn't a code path, it's a prompt, a model name, a temperature, a retrieval setting. You want to swap claude-opus for a cheaper model for 10% of traffic, or A/B two system prompts, without a deploy.

Governing AI agents at runtime. Kill switches, rate limits, and tool-use permissions for autonomous agents acting in production. This is a real and growing need, but it is a policy-and-safety problem, not a flag-lifecycle problem.

This essay is about the first two — code and prompts. The third is a different essay. Conflating them is how you end up buying a "kill switch for your AI" when what you actually have is a tech-debt problem wearing a trench coat.

The objection is correct: flags are inventory with a carrying cost

The strongest case against "just add more flags" was made years before agents could write them, and it has only gotten stronger.

Martin Fowler's canonical treatment is blunt about the cost side: toggles "carry a carrying cost," they are "inventory" that should be kept "as low as possible," and the savviest teams treat removing a toggle as a task you put on the backlog the moment you introduce it. Pete Hodgson, writing in the same piece, recommends teams put "a toggle removal task onto the team's backlog" when the toggle is born.

Dave Farley goes further. In his telling, feature flags are "a practice of last, or at least late, resort" — a tool you reach for when you can't find a cleaner way to decouple deployment from release, not a default you sprinkle everywhere. Branch by abstraction, keystone interfaces, and dark launching are, in his view, often better than reaching for a toggle.

Here's the thing: they're right, and that's exactly the point. The conditional logic a flag adds is real complexity. Two flags interacting is a combinatorial test surface. A flag left in the codebase after its rollout finished is dead weight every future reader has to reason about. None of that gets cheaper because a machine wrote it — and the machine removes the one natural brake the old world had: a human's reluctance to hand-create yet another toggle.

So the honest framing isn't "flags are good, add more." It's: flags are inventory, agents are about to mass-produce that inventory, and the carrying cost is now the whole ballgame. The interesting question is whether that cost is payable.

Cleanup is tractable — the proof already shipped

It is, mechanically. We have strong evidence that the mechanical removal of a stale flag — finding the dead branch, collapsing the conditional, deleting the now-unreachable code — can be automated at industrial scale.

Uber's Piranha is the reference point. The PLDI 2024 paper behind it, "A Lightweight Polyglot Code Transformation Language," reports that PolyglotPiranha-based tools deleted roughly 210,000 lines of stale code (and migrated another 20,000) across Uber's Android and iOS codebases — each on the order of 7.5M lines — over 1,611 pull requests. Uber has separately described using Piranha to remove around two thousand stale feature flags and their related code. Individual cleanup diffs generate in under three minutes.

Read that again: a machine can already take a flag name and emit the PR that deletes the flag and the code it guarded. The mechanical half of retirement is a solved problem with a public, peer-reviewed track record.

And yet the dedicated commercial space for this just emptied out. Gitar — a startup whose first product fully automated feature-flag removal across many languages, founded by the very Uber engineers who built Piranha — was acquired by Sonar in May 2026 and folded into a broader AI code-review platform. The standalone "we delete your stale flags" company stopped being a standalone company.

That's a signal worth sitting with. If mechanical removal were the whole problem, an automated-cleanup business would be a license to print money in exactly the moment flag volume is about to explode. The fact that it isn't tells you the value was never in the diff generation.

The unsolved part is "is this flag safe to remove?"

The mechanical edit is easy. The decision to make it is hard.

Deleting a flag is a one-way door. To walk through it safely you have to answer questions a refactoring tool cannot:

  • Has this rollout actually completed — is it at 100% for everyone, in every environment, or just in the one you're looking at?
  • Is anything still reading it — a cron job, a mobile client three versions back, a downstream service, a contractor's script?
  • Was the flag a temporary release toggle (delete it) or a permanent operational switch like a kill switch or a circuit breaker (keep it forever)? Fowler's taxonomy matters here: release toggles are transient, ops toggles are not, and you cannot tell them apart from the code alone.
  • Did the rollout succeed, or is it sitting at 100% only because someone forgot to roll it back after it half-broke?

These are lifecycle and provenance questions, not syntax questions. They require knowing the flag's history — when it was created, by whom (or what), what it was for, how its rollout actually progressed, and who still depends on it. A tool that can rewrite the AST but can't see that history is bringing a chainsaw to a decision that needs a paper trail. This is the genuinely under-solved part of flag lifecycle, and it gets worse the more flags exist.

The loop: a flag is born when the agent ships and dies when the rollout completes

The clean version of this future ties every flag to one lifecycle object with a defined beginning and end:

  1. An agent finishes a unit of work and ships it behind a flag at 0% — landing in main no longer means "live for users."
  2. The flag carries its own intent from birth: what it guards, that it is a temporary release toggle, and the condition that ends its life ("retire at 100%, all environments, no readers").
  3. A human (or a policy) promotes the rollout. Each step is recorded.
  4. When the exit condition is met and nothing still reads it, the flag is retired — the toggle removed and the dead branch collapsed, by the same kind of automated edit Piranha proved out.

A flag that is born with its own death condition is a flag that doesn't become permanent debt by default. That is the shape of the answer to flag explosion.

Be precise about status, because overclaiming here would undercut the whole essay. Today you can assemble this loop from real primitives — config stored as files an agent writes in the same change as the code, a scriptable CLI an agent can call, and OpenFeature for the evaluation interface. The agent creates the flag file in its PR; CI and delivery telemetry tell you when the rollout is complete and whether anything still reads it; a scheduled job opens the retirement PR. Every piece exists. What does not exist yet — anywhere, from anyone — is a single seamless, one-click native version where the platform watches the rollout and retires the flag for you with no glue. That's the direction the industry is walking toward, mine included. It is not a button you can buy today, and anyone who says it is, is selling you the vision as the product.

Why git-native storage is the real lever here

Granting all of the above, here is the specific, defensible claim: the flag lifecycle should live in git, because the hard part of lifecycle is provenance, and git is a provenance machine.

Every question that makes "is this safe to remove?" hard is a question about history. Who created this flag? When? In the same change as which code? What did its rollout look like over time? Was the last edit a promotion or a panicked rollback?

If your flags live as rows in a vendor's database with a changelog bolted on, those answers live in a system separate from your code, reviewed (if at all) by a separate process, owned by someone else. If your flags live as files in your own git repo, every one of those answers is already there, in the form every engineer and every agent already knows how to read:

  • Creation is a commit — with an author (human or agent), a timestamp, and the guarded code in the same diff. Provenance is automatic.
  • Every rollout step is a commit — promotion from 0% to 50% to 100% is a reviewable diff, not an opaque mutation in someone's database.
  • Retirement is a commit — the flag's death is a reviewed PR with an author, sitting in the same history as its birth. The lifecycle is legible end to end.
  • git log and git blame answer the safety questions directly — and so can an agent, with no special API, because the substrate is just files.

This is not "git-backed with a database in front." It is config as actual files in actual git, so the lifecycle of a flag is a git history — diffable, attributable, reviewable, and clonable by you at any time. That is the angle I'll defend: not that flags are good (they're inventory), and not that creation is hard (it never was), but that if the lifecycle is the hard part, the lifecycle belongs somewhere built for auditable history.

Where flags do not help — and pretending otherwise is the trap

A trustworthy essay has to mark its own boundaries. Feature flags are a deployment-decoupling tool. They are not a general safety net for everything an agent does, and treating them as one is how teams get burned.

Flags do not improve an agent's reasoning. A flag controls whether a code path runs. It has nothing to say about whether the code the agent wrote is correct, whether its plan was sound, or whether it hallucinated an API. Gating bad code at 0% keeps it off users; it does not make the code good. That is the job of tests, review, and evaluation — the flag just buys you time to do them.

Flags do not make irreversible operations reversible. This is the dangerous one. A flag around a code path is cheap to flip back. A flag around a data migration, a destructive write, a schema change, or an external side effect (a charge, an email, a third-party API call) is a trap, because flipping the flag off does not un-send the email or un-drop the column. Anything that mutates state outside the toggle's control is not actually behind the toggle. Agents are especially prone to this — they'll happily wrap a migration in a flag and call it safe.

Flags are not a substitute for the practices Farley named. Sometimes branch by abstraction, a keystone interface, or simply not shipping the change yet is the right call, and adding a toggle is the lazy one. The agent era multiplies flags; it does not repeal the judgment about when not to use one.

The honest summary: flags decouple deploy from release for reversible code changes. That is a genuinely valuable thing, and in an agent-heavy codebase it is valuable thousands of times a day. It is also all that they do.

The takeaway

Agents are about to create flags faster than humans have ever retired them. The objections from Fowler and Farley aren't refuted by that — they're amplified by it, because flag debt is precisely the carrying cost that explodes. The mechanical half of cleanup is already a solved, peer-reviewed problem; the unsolved half is the lifecycle decision of whether a flag is safe to retire, and that decision is a question about provenance and history. So the feature flag system for the age of AI-written code is not the one with the best dashboard for creating flags. It's the one that makes a flag's entire life — birth, rollout, and a safe, deliberate death — a reviewable, attributable, auditable history you own.

That's the bet behind Quonfig: config as files in git, so the lifecycle is a git log. The fully-automatic loop is still the direction, not a finished button — but the substrate it needs to stand on is the part you can have today.

Want to try it?

Quonfig stores your config in git. Feature flags, dynamic config, log levels, and secrets — all as files you own.