March 2026
Preference TOML
Use a config format the model already knows.
More importantly, use semantics the model has already seen in alignment, evaluation, and critique papers.
The Basic Idea
TOML is not magical. It is just readable.
The useful part is that keys like reward, criterion, preference, critique, revision, and accept_if map onto concepts that recur in RLHF, RLAIF, rubric judging, and self-refinement work.
I am making one inference here: models likely respond well to these names because they appear throughout pretraining and post-training corpora. The direct ablation on field names is still missing.
Why This Works Better Than Random DSL Flavor
Do not invent cute nouns when standard ones exist.
If you call a score bucket vibes, the model has to infer what you mean. If you call it criterion with a weight, it already knows the shape of the task.
That matters because empirical work keeps landing on the same pattern: explicit principles, rubrics, critique steps, and structured scoring improve control.
TOML Example
task = "add retry support to api client"
objective = "retry 429 and 503 with bounded exponential backoff"
[reward]
style = "weighted_rubric"
[[criterion]]
name = "correctness"
weight = 5
required = true
[[criterion]]
name = "tests"
weight = 5
required = true
[[criterion]]
name = "edge_cases"
weight = 5
[[criterion]]
name = "error_handling"
weight = 5
[[criterion]]
name = "readability"
weight = 5
[[criterion]]
name = "architecture_fit"
weight = 5
[feedback]
mode = "critique_then_revision"
evidence_required = true
focus_low_scores_only = true
[accept_if]
total_gte = 24
correctness_gte = 4
tests_gte = 4
architecture_fit_gte = 4
[reject_on]
silent_failure = true
hardcoded_fixture_logic = true
new_dependency = true
This gives the agent a familiar contract.
It can fill the criteria, explain failures, revise weak spots, and check acceptance conditions without guessing what success means.
Real Workflow
The loop below is the practical version.
1. load toml spec
2. generate patch against the stated objective
3. score each criterion
4. emit critique for scores below threshold
5. revise only weak criteria
6. rescore
7. accept or reject via explicit gates
The first pass can return a structured result like this.
[score]
correctness = 4
tests = 3
edge_cases = 2
error_handling = 2
readability = 4
architecture_fit = 5
[critique]
tests = "missing max-retry coverage"
edge_cases = "503 exhaustion path not covered"
error_handling = "final upstream exception is swallowed"
That output is already actionable. No prose detective work required.
Why the Semantics Matter
The names should match concepts from real alignment and evaluation workflows.
| DSL Term | Empirical Concept | Why It Helps |
|---|---|---|
criterion | rubric dimension | breaks one fuzzy target into scoreable axes |
weight | reward shaping | makes tradeoffs explicit |
feedback | verbal reinforcement | turns scores into revision targets |
preference | pairwise comparison | lets the agent rank alternatives when scalar scoring is weak |
accept_if | policy gate | prevents high total scores from masking critical failures |
reject_on | hard constraint | blocks known reward-hacking patterns |
evidence_required | evidence-anchored judging | forces the model to point at code or tests |
Preference Mode
Sometimes scalar scoring is not enough.
If two patches are both plausible, a pairwise preference block can work better because RLHF systems are often trained from ranked comparisons.
[preference]
mode = "pairwise"
prompt = "choose the patch that better satisfies the rubric"
require_rationale = true
[choose_if]
correctness = "higher"
architecture_fit = "higher"
readability = "higher"
new_complexity = "lower"
This is especially useful when the agent has two candidate implementations and the better one is more obvious in comparison than in isolation.
Alternative Syntaxes
The semantics matter more than the wrapper.
JSON
{
"reward": { "style": "weighted_rubric" },
"criterion": [
{ "name": "correctness", "weight": 5, "required": true },
{ "name": "tests", "weight": 5, "required": true }
],
"accept_if": { "total_gte": 10, "correctness_gte": 4 }
}
XML
<reward style="weighted_rubric">
<criterion name="correctness" weight="5" required="true" />
<criterion name="tests" weight="5" required="true" />
<accept_if total_gte="10" correctness_gte="4" />
</reward>
Ruby-ish DSL
reward do
criterion :correctness, weight: 5, required: true
criterion :tests, weight: 5, required: true
accept_if total_gte: 10, correctness_gte: 4
end
TOML tends to be the least annoying of the bunch for human editing.
Empirical Mapping
Several primary results line up with this pattern.
- Constitutional AI showed that a list of explicit principles can drive critique, revision, and preference modeling with far fewer human labels.
- G-Eval improved evaluator alignment with chain-of-thought and a form-filling rubric.
- Reflexion showed that verbal feedback can materially improve coding performance.
- Self-Refine showed that iterative self-feedback improves outputs without extra training.
- RULERS argues that executable rubrics, evidence anchoring, and calibrated scales beat loose prompt phrasing for reliable judging.
The common thread is boring and useful: explicit criteria plus explicit feedback loops.
Practical Notes
Keep the vocabulary plain.
Keep the schema small enough that the agent can hold the whole thing in working memory.
Use hard constraints for known failure modes. Use weighted criteria for everything else.
If you want the model to check all the boxes, make the boxes literal.
Empirical Findings (Starfish Method)
START
Start naming fields after concepts the model has probably already seen: criterion, weight, preference, critique, revision.
Start using explicit acceptance gates for critical dimensions like correctness and architecture fit.
Start requiring evidence in critique output when the workflow feeds into review automation.
STOP
Stop inventing cute schema names that obscure the semantics. Novel wording is mostly friction.
Stop collapsing everything into one score. That just recreates the reward-hacking problem in a prettier file format.
Stop writing config files that are longer than the patch they evaluate. At that point the process is eating itself.
CONTINUE
Continue using TOML for hand-edited workflows. It is readable and does not fight back.
Continue separating weighted criteria from hard rejections. The distinction matters operationally.
Continue using pairwise preference blocks when two candidate patches are easier to compare than to score independently.
INVESTIGATE
Investigate whether criterion names taken directly from benchmark rubrics improve first-pass compliance further.
Investigate schema-specific drift across models. Some models may parse XML more rigidly and TOML more flexibly.
Investigate whether preference-mode evaluation beats scalar scoring on refactors where correctness is similar but architecture fit differs.
AMPLIFY
Amplify hard constraints for known bad behaviors like silent failure and fixture-specific logic. They eliminate a lot of junk early.
Amplify explicit critique and revision sections. Those fields turned static specs into actual working loops.
Amplify simple semantics over fancy syntax. The useful part is the contract, not the DSL cosplay.
Related Research
- Reward Rubric DSL - A smaller rubric-first version of the same idea
- Constitutional AI: Harmlessness from AI Feedback - Explicit principles driving critique and preference modeling
- G-Eval - Form-filling rubric evaluation with chain-of-thought
- Reflexion - Verbal reinforcement for agents, including coding
- Self-Refine - Iterative feedback and revision at inference time
- RULERS - Executable rubrics and evidence-anchored scoring