97% on train, 82% on test: validation and test data matter more than the loop
Fable 5 had just come out, so we used it to run a prompt-improvement loop end to end. The loop worked. What we learned was why validation and test datasets are what make the result mean anything.
We had just argued in AI is eating the AI engineering loop that agents can run more of the loop now, but only if the loop itself is set up well. When Fable 5 came out and was promoted for loop-shaped work, it felt like a good moment to test that on a simple benchmark.
![]()
We picked a task with one of the cleanest target functions in AI engineering: exact-match accuracy against gold labels.
So we gave Claude Fable 5, running in Claude Code, a classification task, a train/test split in Langfuse Datasets, a prompt in Prompt Management, and a goal: iterate on the train set until you hit 95% accuracy or 15 runs, whichever comes first. Then run the held-out test set once.
It hit the target in 4 runs. The interesting question was not whether Fable could operate the loop. It could. The more interesting lesson was about evaluation design: without a proper validation split, even a clean train/test setup is not enough, because the loop will optimize on train and make the eventual test result feel like a surprise instead of a check. By the end, 11 test errors were shared across every prompt variant. That turned the story from "how do we run loops better?" into "what would the dataset have needed to contain for this loop to actually learn something that transfers?"
Why a classification task
If you are going to hand a loop to an agent, classification is the ideal first candidate:
- A clear target function. Exact-match accuracy. No LLM-as-judge calibration debates.
- Known to be hard. Getting humans to agree on labels is famously difficult. Expecting a model to hit 100% against noisy gold labels is unrealistic, which makes the optimization dynamics interesting.
- It is everywhere. User-intent routing, email triage, support-ticket tagging, legal-case bucketing, fraud and risk categorization, lead qualification, content labeling. These are usually sub-tasks inside larger workflows, exactly the kind of contained, measurable component where a loop like this can already be useful today.
Our concrete task: classify arXiv papers into one of 10 categories, such as Databases, Information Retrieval, Software Engineering, and Sound, from title, authors, and abstract.
The setup
We kept it deliberately simple:
- a train split with 200 labeled examples and a held-out test split with 100, stored in Langfuse Datasets
- a system prompt, with model config and a strict JSON response schema, managed in Langfuse Prompt Management, fetched by the
productionlabel at runtime - a small Python runner using Langfuse Experiments via the SDK: it runs a dataset against the current production prompt, scores every row with exact-match accuracy, and links everything back to the dataset run
gpt-4o-miniat temperature0as the task model
We chose gpt-4o-mini deliberately. A stronger model like gpt-5.5 likely would have done better out of the box, but it is also much more expensive. For a narrow task like this, that tradeoff matters: if a bit of prompt iteration can make a cheaper model perform well enough, that is often a better production choice than paying frontier-model prices on every classification call.
The starting prompt was as bare as it gets: "Classify this paper with a label" plus the flat list of allowed labels.
The auto-improving loop, powered by Fable
The instructions we gave the agent translated into this loop:
- Run the train dataset with the current prompt.
- Score every row as correct or incorrect.
- Write a short qualitative annotation on every error: what went wrong, what likely pulled the model to the wrong label. These were posted as comments on the Langfuse trace.
- Form a hypothesis and revise only the prompt, published as a new prompt version to Langfuse.
- Repeat until accuracy reached 95% or 15 train runs.
- Run the final prompt once on the held-out test split and report the gap.
![]()
We ran this with Claude Code's goal mode, which keeps the agent working autonomously until the stopping condition holds. Experiments ran as background tasks; the agent picked up each result, did its error analysis, published the next prompt version, and kicked off the next run without intervention.
Round 1: the hill sprint
| Run | Prompt strategy | Train accuracy |
|---|---|---|
| 1 | v1 - flat label list. "Classify this paper with a label" plus 10 label names | 78.0% |
| 2 | v2 - definitions + decision rules. One-line definition per label, general boundary rules from error analysis | 90.5% |
| 3 | v3 - sharpened boundary rules. More aggressive IR-vs-DB and HCI-vs-Society rules | 90.0% |
| 4 | v4 - precedent list. Around 30 concrete "pattern -> label" precedents distilled from prior failures | 97.0% |
The first jump is the legitimate one: v1's errors showed the model treating "Emerging Technologies" as a catch-all for anything mentioning LLMs, and missing that education and policy papers belong to "Computers and Society." v2 fixed that with general definitions, a 12.5-point jump.
Run 3 is where it got interesting: the sharpened rules fixed 10 errors and broke 11 papers that run 2 had right. Classic whack-a-mole. Every boundary you push captures lookalikes on the other side.
The agent's response to the whack-a-mole was clever, and exactly wrong: it replaced abstract rules with a list of concrete precedents distilled from the training failures, things like "a census of Windows drivers -> Software Engineering" and "watermarking RAG databases -> Security." Train accuracy jumped to 97%. Stopping condition met, in 4 of the allowed 15 runs.
Then came the held-out test set:
| Prompt | Train | Test | Gap |
|---|---|---|---|
| v2 - general definitions | 90.5% | 84.0% | 6.5 |
| v4 - train-derived precedents | 97.0% | 82.0% | 15.0 |
The precedent list was memorization wearing a trench coat. On test, v4's precedents fixed 4 papers that matched trained patterns and miscaptured 6 lookalikes they were never meant for. Net negative. The "worse" prompt won.
Round 2: "generalize this time"
So we restarted the loop from v2 with new instructions: every prompt change must be a general taxonomy principle backed by a class of errors, at least three failures sharing a mechanism, never a single-paper precedent. And no touching the test set.
| Run | Prompt strategy | Train accuracy |
|---|---|---|
| 5 | v5 - principles rewrite + a reasoning output field | 84.0% |
| 6 | v6 - v2 base + class-level principles such as hardware -> Emerging Tech, "what is success measured by", and "level of analysis" | 91.0% |
| 7 | v7 - IR owns search/recommendation infrastructure; audio is a data-type rule; crypto code -> Security | 93.5% |
| 8 | v8 - subject-vs-representation for audio; rule precedence; serving-cost -> Databases | 94.0% |
| 9 | v9 - unified audio rule; requirements engineering -> Software Engineering | 94.0% |
Two things are worth pausing on.
First, the v5 regression: adding a chain-of-thought-style reasoning field, a change that feels like it should always help, made things worse. The model used the reasoning to rationalize surface cues. At one point it justified labeling a robot-navigation paper as Human-Computer Interaction by calling the vision-language model a "user." Structural changes are hypotheses too. They need the same experimental treatment.
Second, the plateau was honest. By run 8 the agent reported, unprompted, that many of the remaining papers had been missed repeatedly under every general formulation and that fixing them would require the very paper-specific precedents we were deliberately avoiding. Its conclusion was that the realistic ceiling for a generalizable prompt on this train set was around 94 to 95%, and that we should stop instead of chasing the ambiguous tail.
And the final test run?
| Prompt | Train | Test | Gap |
|---|---|---|---|
| v2 - general definitions, round 1 | 90.5% | 84.0% | 6.5 |
| v4 - precedent list, round 1 | 97.0% | 82.0% | 15.0 |
| v9 - general principles, round 2 | 94.0% | 81.0% | 13.0 |
81%. But that headline misses the more important result: 11 test errors were shared by all three prompt variants. With 100 test items, these held-out scores are statistically indistinguishable, which means the disciplined second round bought zero measurable test improvement. The loop got better at fitting the train split while leaving the same hard cases unresolved.
That is the real ceiling in the data. Once the same errors survive every prompt variant, the problem stops looking like "we need another round of prompt edits" and starts looking like "prompt iteration is now the wrong tool." Some of those papers sit on boundaries humans would argue about too. At that point, more examples, clearer label definitions, or a different task model are probably higher-leverage than another pass through the loop.
What we learned
1. Train, validation, and test do different jobs. Exact-match accuracy against gold labels sounds foolproof, but if you keep selecting prompt versions on train accuracy, you still overfit. In hindsight, we should have stuck with the boring best practice: train for fitting, a validation split for selection, and the test set used once at the end.
2. The stronger finding is not just that we skipped validation. It is that even a disciplined round 2, with more principled and general prompt changes, bought zero measurable test improvement. By that point the loop was no longer teaching us how to prompt better. It was teaching us that prompt iteration had run out of room.
3. That leaves the open question in the dataset. What would the benchmark itself have needed to look like for this loop to work? More repeated boundary cases, clearer category definitions, more examples of ambiguous classes, a true validation split, maybe even an "unsure" bucket? Until that answer is clearer, running the loop harder is probably the wrong move.
Where this is actually useful today
None of this means "do not automate the loop." It means: automate the inner loop, own the outer one. A realistic split for a classification task like this:
- Agent-owned: running experiments, scoring, per-error annotation, drafting hypothesis-driven prompt revisions, diffing errors across runs, flagging plateaus
- Human-owned: the target function, including the validation and held-out test data nobody optimizes against, dataset composition, when to restart with different constraints, and when to stop
The infrastructure for the agent-owned half is exactly what Langfuse provides: datasets, prompt versioning, experiments, and trace comments give the agent a full read/write workbench, and give you the audit trail to vouch for what it did.
That last part matters most. The agent will get to your target. Make sure it is the right one. And if the validation split stays flat, the next move is probably not a better loop. It is a better dataset.
All experiments: gpt-4o-mini, temperature 0, strict JSON schema output. Optimizer agent: Claude Fable 5 in Claude Code with goal mode. 9 train runs, 200 items, plus 3 test runs, 100 items, across both rounds. Full prompt version history and per-run error annotations live in Langfuse.