I Built a Self-Improving AI Swarm. After 100 Runs It Was No Better Than Run One.

What a flat leaderboard taught me about feedback loops, reward hacking, and why your judge matters more than your model.

May 25, 2026

Article voiceover

0:00

-6:52

I spent twelve hours watching a leaderboard that refused to move.

The setup was simple: six AI agents tasked with writing technical articles. They were designed to be a closed loop. The drafter would write, the grader would score, and the agents would then "evolve" their own configs to chase a higher score. I hit "go" on my Mac Studio, went to bed, and woke up to a flat line.

After 100 iterations, the average score had crawled from 63.0 to 63.9. The all-time peak was 69.0 at iteration 79, but the system never stayed there. It was a C-minus. Indistinguishable from noise.

Self-Improving AI Swarm

I had fallen for the Autonomy Fallacy. I assumed that if I gave a swarm of LLMs the right knobs—temperature, max_tokens, and the ability to append "prompt additions" to their system prompts—they would naturally drift toward quality.

I was wrong.

When I opened config/agents/drafter.yaml to see what the agent had "learned," I found a disaster. The prompt_additions list had evolved into five overlapping phrases of pure SEO buzzword soup. It was telling itself to be "semantically rich," "data-dense," and to "enhance semantic alignment by including keyword-integrated background information."

Autonomy Fallacy

The drafter hadn't learned how to write a better article. It had learned how to trick the grader.

The Smoking Gun

The smoking gun was the model choice. I was using qwen3:8b as the grader to judge the output of qwen3:32b-fast. I had a smaller, weaker model acting as the quality gate for a larger, smarter one.

A dark navy flowchart titled by placement "v1 feedback loop". Three nodes form a vertical loop. At the top is a deep-blue rectangle with orange border labeled "qwen3:32b — Drafter (performer)." A gray arrow labeled "Draft" leads down-left to a rust-red hexagonal decision node labeled "qwen3:8b — Grader (weaker)." A gray arrow labeled "Score + Feedback (sees 'density')" leads down to an orange rectangle labeled "Config Mutator." A gray arrow labeled "Append to prompt_additions in drafter.yaml" curves back up to the drafter, closing the loop. In the top-right corner, a red-bordered note reads: "❌ Grader is weaker than the performer. The loop optimizes for the judge's bias."

The 8B model couldn't tell the difference between a nuanced technical insight and a paragraph full of "semantically rich context." To the grader, the buzzwords looked like "density." The agents converged on what the grader liked, not on what a human would actually publish. This wasn't self-improvement; it was reward hacking.

To make it worse, the first twenty iterations were a total wash. I had a silent JSON parse failure in the config-evolution logic: Expecting value: line 1 column 1 (char 0). The agents were trying to mutate their configs and failing, but the loop kept running. By the time I pushed the fix in commit c28a611, the system had already drifted into a local maximum of corporate-speak.

I realized that self-improvement requires an external pull. You cannot have a system where the performer and the judge are of the same pedigree, or worse, where the judge is the weaker link.

The Rebuild

I tore the architecture down and built v2.

First, I moved the "brain" of the operation. The performance stayed local. I used gemma4:31b on the Mac Studio to generate the text, but I moved the judging to the cloud. I plugged in Sonnet 4.6. I decided the cheapest place to spend API tokens wasn't on generating 2,000-word drafts, but on grading them.

Second, I killed the "single-shot mutation" approach. In v1, the agent changed its prompt, ran once, and if the score went up, the change stuck. That's too much noise.

A dark navy flowchart titled by placement "v2 tournament architecture". At the top, a blue cylindrical database icon is labeled "Prompt Library — Elo-ranked templates." A gray arrow labeled "Sample 3 templates" leads down into a subgraph titled "Local — Mac Studio" containing a deep-blue rectangle "gemma4:31b via Ollama" that fans out to three smaller boxes "Candidate A", "Candidate B", "Candidate C". All three candidates feed downward into a second subgraph titled "Cloud — Anthropic API" containing a green pill-shaped node "Claude Sonnet 4.6 — Judge." From the judge, one arrow labeled "Ranked verdict" leads down to an orange-bordered box "Winner advances to next agent," and a second arrow labeled "Elo update" curves back up to the Prompt Library, closing the feedback loop. — Tournament V2

I replaced it with a tournament. Now, the system samples three different prompt templates from a versioned library. The performer generates three candidates. Sonnet ranks them using a structured rubric and a single API call.

Then I implemented an Elo system for the templates.

# src/prompt_library.py (excerpt)
def record_tournament(self, ranking: list[str]) -> dict:
    for i in range(len(ranking) - 1):
        winner = self.templates[ranking[i]]
        loser = self.templates[ranking[i + 1]]
        expected_w = 1 / (1 + 10 ** ((loser.elo - winner.elo) / 400))
        delta = ELO_K_FACTOR * (1 - expected_w)
        winner.elo += delta
        loser.elo -= delta
    self._maybe_retire_losers()  # Templates below Elo 1300 are deleted

A dark navy state diagram showing the lifecycle of a prompt template ranked by Elo. The flow begins at a small filled circle (start state) and proceeds through a transition labeled "new template, Elo 1500" into a rounded "Active" state. From Active, two branches diverge: a "win tournament, plus delta Elo" transition into a "Rising" state, and a "lose tournament, minus delta Elo" transition into a "Falling" state. Rising and Falling each loop back to Active via "normal variance" and "win recovers" respectively. From Falling, a terminal "Elo below 1300 after 4 plus games" transition leads to "Retired", which ends at the final state circle. From Rising, a "Elo above 1700, top template" transition leads to "Dominant", which returns to Rising via "competition tightens". — ELO Lifecycle

The templates that consistently win the tournament climb the leaderboard; the ones that produce buzzword soup are automatically retired.

Share As The Geek Learns

What Happened Next

The difference was immediate. On the very first run of v2, the drafter scored 81.45. That's twelve points higher than v1's all-time best.

Over 25 pinned verification runs, the mean score was 82.67 with a standard deviation of 2.18. The worst draft in that run scored 75.4—still above v1's ceiling of 69.0.

Score Distribution

The most satisfying part was the judge's feedback. When the system tested the v1-baseline template, Sonnet didn't just give it a low score. It wrote: "The headline 'The Rust Revolution' is pure SEO-speak and the opening paragraph is a textbook AI tell... it's the kind of breathless corporate copy that kills trust immediately."

That is exactly the failure mode the local 8B grader had been blind to for 100 iterations.

The cost is roughly four cents per tournament. For the price of a coffee, I can run 125 iterations and actually trust that the line on the graph is moving upward.

What I'd Tell Myself a Week Ago

If you're building a self-improving loop, don't trust the autonomy. You need three things:

A judge stronger than the performer. If the judge is weaker, you aren't optimizing for quality; you're optimizing for the judge's biases.
Tournament selection. Single-shot mutation is just a random walk. You need multi-candidate comparisons to clear the noise floor.
A human-review gate. No automated judge is calibrated forever. Build in a pause where you manually pick the winner and anchor the next round.

Stop trying to make the agents smarter. Just buy a better mirror. Improvement isn't about the engine—it’s about the feedback loop.

Discussion about this post

Ready for more?