Back to blog

Developer Diary

We let agents ship behind auto-created flags — here's the explosion curve

Jeff Dwyer·June 19, 2026

We run an autonomous agent loop on Quonfig's own monorepo. It picks the highest-priority issue off a queue, dispatches an agent to do the work, and writes the result to a daily digest. Over roughly ten weeks that loop closed 1,018 issues (out of 1,046 filed). Every config and flag change it made is a real git commit — author, diff, timestamp — in a real repository we can clone. This post is the honest version of what that did to our flag count, and what kept it from getting out of hand.

The setup: a loop that ships, not a demo

The orchestrator is a Python script (scripts/agent-loop.py) that does something deliberately boring: read the issue tracker, find the most important ready task, spawn a Claude agent in the relevant sub-repo, and let it work to completion — write the failing test, make it pass, run the suite, and either ship or block for a human on anything that touches a public contract. When it finishes, it appends what happened to digests/YYYY-MM-DD.md.

There are 61 of those digest files, from 2026-04-09 through today. They reference 338 distinct issues by ID. This is not a sandbox that generates throwaway PRs to look busy — it's the loop that built the secondary delivery network, migrated six customers to a new billing meter, and standardized failover behavior across all six backend SDKs. It ships to production-adjacent code daily, and it's protected by an agent constitution that defines exactly what it can ship freely versus what it has to stop and ask about.

What we mean by flag explosion

The thesis we keep coming back to — see the flag lifecycle in the age of AI-written code — is that when agents write most of your code, they create most of your flags. A flag gets born when an agent ships work behind it, and it should die when the rollout completes. The scary version of that world is unbounded growth: thousands of half-dead flags nobody remembers, because the thing creating them never comes back to clean up.

So when we turned the loop loose on our own config, the honest question was: does the flag count run away?

The interesting answer is that the throughput exploded but the live flag count didn't. Over those ten weeks the loop shipped a large pile of guarded changes, but our packaged workspace (our-config) today holds 8 feature flags, 34 configs, and 3 dynamic log-level keys — 45 config files total. That's the live set, right now, in git. The explosion is in changes shipped, not in flags accumulated — and that gap is the whole point.

Why the count stayed small: the flag dies in the same place it was born

Here's the part I genuinely didn't appreciate until I watched it happen. Because a Quonfig flag is just a JSON file in a git repo, creating one and retiring one are the same kind of operation: a commit. There's no separate "go clean up the dashboard" chore that lives outside the codebase and never gets prioritized. The flag's whole life is in the diff stream.

A real example from the loop's work: infra.chaos.freeze-sse-flushes. The agent needed a kill-switch to fault-inject SSE delivery during failover testing. So it added the flag, scoped on by a targeting rule to exactly one staging workspace:

{
  "key": "infra.chaos.freeze-sse-flushes",
  "type": "feature_flag",
  "valueType": "bool",
  "environments": [
    {
      "id": "staging",
      "rules": [
        {
          "criteria": [
            {
              "operator": "PROP_IS_ONE_OF",
              "propertyName": "orgWorkspace.key",
              "valueToMatch": {
                "type": "string_list",
                "value": ["qfg-migrate-e2e-20260511/synthetic-monitor"]
              }
            }
          ],
          "value": { "type": "bool", "value": true }
        },
        {
          "criteria": [{ "operator": "ALWAYS_TRUE" }],
          "value": { "type": "bool", "value": false }
        }
      ]
    }
  ]
}

Default false everywhere, true for one synthetic monitor in staging. That's a flag created to do the work, not to be admired. When the chaos run is done, retiring it is one more commit — git rm plus deleting the read in the code, reviewable together, revertable together. The flag and the change that needed it are a single artifact, which is exactly the property that makes the long tail manageable instead of terrifying.

Other flags the loop touched look the same: quonfig.delivery.warm-on-boot (a delivery-layer behavior, default on), app.billing.metering-enabled, billing.grace-period-enabled (unblocks writes for a specific org by organization id during a billing grace period). Each is a small, legible JSON file. None of them is a mystery row in a database.

The audit trail came for free

The thing that surprised me least but matters most: every one of these changes is auditable by default, because the storage is git. our-config has 66 commits so far, the first on 2026-03-31. I can git log the whole history of any flag — when it flipped, what the diff was, and who (or what) made the change. Most of those commits are mine, a handful are the migrator and the agent loop, and I can tell them apart at a glance. No export, no API, no "contact us for your audit log." Clone the repo and it's all there.

This is the property the git-native architecture was built for, and watching an agent author config made me appreciate it in a new way. When a machine is the one writing flags by the dozen, "is this change reviewable?" stops being a nice-to-have and becomes the only thing standing between you and config chaos. A diff is reviewable. A row that silently mutated in someone else's database is not.

We dogfood the prompt path too

One more honest detail, because it's the same idea pointed at AI features instead of infra. Quonfig's "infer a JSON Schema from example documents" feature reads its LLM prompt from a Quonfig config — our-config/configs/llm.infer-json-schema.json — not from hardcoded source. The model (claude-sonnet-4-6), the system prompt, the temperature, and the token budget all live in that config file. To tune the prompt, we edit the config and commit; there's no redeploy and no code change. Prompts are config, and config is git. We ship that feature on exactly the primitive we sell.

The honest lessons

A few things I'd tell anyone pointing an agent loop at their own config:

  • The lifecycle is the hard part, and git makes it tractable, not automatic. Co-locating the flag with the change means retirement is possible in one reviewable step — but something still has to decide a rollout is done. We do that with issue discipline and human review on the flags that matter, not with magic.
  • Single-digit live flags is a feature, not luck. We didn't end up with 8 flags because the loop was timid; we ended up there because retiring a flag is as cheap as creating one when both are commits. The cost asymmetry that creates flag debt elsewhere mostly isn't here.
  • An agent authoring config is only safe if the config is reviewable. The constitution makes the loop block on anything touching storage format, SDK contracts, auth, or schema. The reason that's enforceable at all is that the artifact under review is a diff a human can actually read.

We're not claiming the fully-automated "create on commit, retire at 100%" loop is a shipped button — it isn't, and I'd be lying if I said otherwise. What's real today is the substrate: git-native config files plus a scriptable qfg CLI plus OpenFeature, which is enough to assemble the lifecycle yourself. We run a version of it on ourselves every day, and the explosion curve so far bends the right way: lots shipped, very little left lying around.

If you want the thesis behind all this — why most flags in the agent era will be machine-written, and what that demands of a flag system — read the flag lifecycle essay next.

-- Jeff

Want to try it?

Quonfig stores your config in git. Feature flags, dynamic config, log levels, and secrets — all as files you own.