I Built a Self-Improving AI Swarm. After 100 Runs It Was No Better Than Run One.
What a flat leaderboard taught me about feedback loops, reward hacking, and why your judge matters more than your model.
I spent twelve hours watching a leaderboard that refused to move.
The setup was simple: six AI agents tasked with writing technical articles. They were designed to be a closed loop. The drafter would write, the grader would score, and the agents would then "evolve" their own configs to chase a higher score. I hit "go" on my Mac Studio, went to bed, and woke up to a flat line.
After 100 iterations, the average score had crawled from 63.0 to 63.9. The all-time peak was 69.0 at iteration 79, but the system never stayed there. It was a C-minus. Indistinguishable from noise.
I had fallen for the Autonomy Fallacy. I assumed that if I gave a swarm of LLMs the right knobs—temperature, max_tokens, and the ability to append "prompt additions" to their system prompts—they would naturally drift toward quality.
I was wrong.
When I opened config/agents/drafter.yaml to see what the agent had "learned," I found a disaster. The prompt_additions list had evolved into five overlapping phrases of pure SEO buzzword soup. It was telling itself to be "semantically rich," "data-dense," and to "enhance semantic alignment by including keyword-integrated background information."
The drafter hadn't learned how to write a better article. It had learned how to trick the grader.
The Smoking Gun
The smoking gun was the model choice. I was using qwen3:8b as the grader to judge the output of qwen3:32b-fast. I had a smaller, weaker model acting as the quality gate for a larger, smarter one.
The 8B model couldn't tell the difference between a nuanced technical insight and a paragraph full of "semantically rich context." To the grader, the buzzwords looked like "density." The agents converged on what the grader liked, not on what a human would actually publish. This wasn't self-improvement; it was reward hacking.
To make it worse, the first twenty iterations were a total wash. I had a silent JSON parse failure in the config-evolution logic: Expecting value: line 1 column 1 (char 0). The agents were trying to mutate their configs and failing, but the loop kept running. By the time I pushed the fix in commit c28a611, the system had already drifted into a local maximum of corporate-speak.
I realized that self-improvement requires an external pull. You cannot have a system where the performer and the judge are of the same pedigree, or worse, where the judge is the weaker link.
The Rebuild
I tore the architecture down and built v2.
First, I moved the "brain" of the operation. The performance stayed local. I used gemma4:31b on the Mac Studio to generate the text, but I moved the judging to the cloud. I plugged in Sonnet 4.6. I decided the cheapest place to spend API tokens wasn't on generating 2,000-word drafts, but on grading them.
Second, I killed the "single-shot mutation" approach. In v1, the agent changed its prompt, ran once, and if the score went up, the change stuck. That's too much noise.
I replaced it with a tournament. Now, the system samples three different prompt templates from a versioned library. The performer generates three candidates. Sonnet ranks them using a structured rubric and a single API call.
Then I implemented an Elo system for the templates.
# src/prompt_library.py (excerpt)
def record_tournament(self, ranking: list[str]) -> dict:
for i in range(len(ranking) - 1):
winner = self.templates[ranking[i]]
loser = self.templates[ranking[i + 1]]
expected_w = 1 / (1 + 10 ** ((loser.elo - winner.elo) / 400))
delta = ELO_K_FACTOR * (1 - expected_w)
winner.elo += delta
loser.elo -= delta
self._maybe_retire_losers() # Templates below Elo 1300 are deletedThe templates that consistently win the tournament climb the leaderboard; the ones that produce buzzword soup are automatically retired.
What Happened Next
The difference was immediate. On the very first run of v2, the drafter scored 81.45. That's twelve points higher than v1's all-time best.
Over 25 pinned verification runs, the mean score was 82.67 with a standard deviation of 2.18. The worst draft in that run scored 75.4—still above v1's ceiling of 69.0.
The most satisfying part was the judge's feedback. When the system tested the v1-baseline template, Sonnet didn't just give it a low score. It wrote: "The headline 'The Rust Revolution' is pure SEO-speak and the opening paragraph is a textbook AI tell... it's the kind of breathless corporate copy that kills trust immediately."
That is exactly the failure mode the local 8B grader had been blind to for 100 iterations.
The cost is roughly four cents per tournament. For the price of a coffee, I can run 125 iterations and actually trust that the line on the graph is moving upward.
What I'd Tell Myself a Week Ago
If you're building a self-improving loop, don't trust the autonomy. You need three things:
A judge stronger than the performer. If the judge is weaker, you aren't optimizing for quality; you're optimizing for the judge's biases.
Tournament selection. Single-shot mutation is just a random walk. You need multi-candidate comparisons to clear the noise floor.
A human-review gate. No automated judge is calibrated forever. Build in a pause where you manually pick the winner and anchor the next round.
Stop trying to make the agents smarter. Just buy a better mirror. Improvement isn't about the engine—it’s about the feedback loop.








