Preference TOML

March 2026

Use a config format the model already knows.

More importantly, use semantics the model has already seen in alignment, evaluation, and critique papers.

The Basic Idea

TOML is not magical. It is just readable.

The useful part is that keys like reward, criterion, preference, critique, revision, and accept_if map onto concepts that recur in RLHF, RLAIF, rubric judging, and self-refinement work.

I am making one inference here: models likely respond well to these names because they appear throughout pretraining and post-training corpora. The direct ablation on field names is still missing.

Why This Works Better Than Random DSL Flavor

Do not invent cute nouns when standard ones exist.

If you call a score bucket vibes, the model has to infer what you mean. If you call it criterion with a weight, it already knows the shape of the task.

That matters because empirical work keeps landing on the same pattern: explicit principles, rubrics, critique steps, and structured scoring improve control.

TOML Example

task = "add retry support to api client"
objective = "retry 429 and 503 with bounded exponential backoff"

[reward]
style = "weighted_rubric"

[[criterion]]
name = "correctness"
weight = 5
required = true

[[criterion]]
name = "tests"
weight = 5
required = true

[[criterion]]
name = "edge_cases"
weight = 5

[[criterion]]
name = "error_handling"
weight = 5

[[criterion]]
name = "readability"
weight = 5

[[criterion]]
name = "architecture_fit"
weight = 5

[feedback]
mode = "critique_then_revision"
evidence_required = true
focus_low_scores_only = true

[accept_if]
total_gte = 24
correctness_gte = 4
tests_gte = 4
architecture_fit_gte = 4

[reject_on]
silent_failure = true
hardcoded_fixture_logic = true
new_dependency = true

This gives the agent a familiar contract.

It can fill the criteria, explain failures, revise weak spots, and check acceptance conditions without guessing what success means.

Real Workflow

The loop below is the practical version.

1. load toml spec
2. generate patch against the stated objective
3. score each criterion
4. emit critique for scores below threshold
5. revise only weak criteria
6. rescore
7. accept or reject via explicit gates

The first pass can return a structured result like this.

[score]
correctness = 4
tests = 3
edge_cases = 2
error_handling = 2
readability = 4
architecture_fit = 5

[critique]
tests = "missing max-retry coverage"
edge_cases = "503 exhaustion path not covered"
error_handling = "final upstream exception is swallowed"

That output is already actionable. No prose detective work required.

Why the Semantics Matter

The names should match concepts from real alignment and evaluation workflows.

DSL Term	Empirical Concept	Why It Helps
`criterion`	rubric dimension	breaks one fuzzy target into scoreable axes
`weight`	reward shaping	makes tradeoffs explicit
`feedback`	verbal reinforcement	turns scores into revision targets
`preference`	pairwise comparison	lets the agent rank alternatives when scalar scoring is weak
`accept_if`	policy gate	prevents high total scores from masking critical failures
`reject_on`	hard constraint	blocks known reward-hacking patterns
`evidence_required`	evidence-anchored judging	forces the model to point at code or tests

Preference Mode

Sometimes scalar scoring is not enough.

If two patches are both plausible, a pairwise preference block can work better because RLHF systems are often trained from ranked comparisons.

[preference]
mode = "pairwise"
prompt = "choose the patch that better satisfies the rubric"
require_rationale = true

[choose_if]
correctness = "higher"
architecture_fit = "higher"
readability = "higher"
new_complexity = "lower"

This is especially useful when the agent has two candidate implementations and the better one is more obvious in comparison than in isolation.

Alternative Syntaxes

The semantics matter more than the wrapper.

JSON

{
  "reward": { "style": "weighted_rubric" },
  "criterion": [
    { "name": "correctness", "weight": 5, "required": true },
    { "name": "tests", "weight": 5, "required": true }
  ],
  "accept_if": { "total_gte": 10, "correctness_gte": 4 }
}

XML

<reward style="weighted_rubric">
  <criterion name="correctness" weight="5" required="true" />
  <criterion name="tests" weight="5" required="true" />
  <accept_if total_gte="10" correctness_gte="4" />
</reward>

Ruby-ish DSL

reward do
  criterion :correctness, weight: 5, required: true
  criterion :tests, weight: 5, required: true
  accept_if total_gte: 10, correctness_gte: 4
end

TOML tends to be the least annoying of the bunch for human editing.

Empirical Mapping

Several primary results line up with this pattern.

Constitutional AI showed that a list of explicit principles can drive critique, revision, and preference modeling with far fewer human labels.
G-Eval improved evaluator alignment with chain-of-thought and a form-filling rubric.
Reflexion showed that verbal feedback can materially improve coding performance.
Self-Refine showed that iterative self-feedback improves outputs without extra training.
RULERS argues that executable rubrics, evidence anchoring, and calibrated scales beat loose prompt phrasing for reliable judging.

The common thread is boring and useful: explicit criteria plus explicit feedback loops.

Practical Notes

Keep the vocabulary plain.

Keep the schema small enough that the agent can hold the whole thing in working memory.

Use hard constraints for known failure modes. Use weighted criteria for everything else.

If you want the model to check all the boxes, make the boxes literal.

Empirical Findings (Starfish Method)

START

Start naming fields after concepts the model has probably already seen: criterion, weight, preference, critique, revision.

Start using explicit acceptance gates for critical dimensions like correctness and architecture fit.

Start requiring evidence in critique output when the workflow feeds into review automation.

STOP

Stop inventing cute schema names that obscure the semantics. Novel wording is mostly friction.

Stop collapsing everything into one score. That just recreates the reward-hacking problem in a prettier file format.

Stop writing config files that are longer than the patch they evaluate. At that point the process is eating itself.

CONTINUE

Continue using TOML for hand-edited workflows. It is readable and does not fight back.

Continue separating weighted criteria from hard rejections. The distinction matters operationally.

Continue using pairwise preference blocks when two candidate patches are easier to compare than to score independently.

INVESTIGATE

Investigate whether criterion names taken directly from benchmark rubrics improve first-pass compliance further.

Investigate schema-specific drift across models. Some models may parse XML more rigidly and TOML more flexibly.

Investigate whether preference-mode evaluation beats scalar scoring on refactors where correctness is similar but architecture fit differs.

AMPLIFY

Amplify hard constraints for known bad behaviors like silent failure and fixture-specific logic. They eliminate a lot of junk early.

Amplify explicit critique and revision sections. Those fields turned static specs into actual working loops.

Amplify simple semantics over fancy syntax. The useful part is the contract, not the DSL cosplay.

Related Research

Reward Rubric DSL - A smaller rubric-first version of the same idea
Constitutional AI: Harmlessness from AI Feedback - Explicit principles driving critique and preference modeling
G-Eval - Form-filling rubric evaluation with chain-of-thought
Reflexion - Verbal reinforcement for agents, including coding
Self-Refine - Iterative feedback and revision at inference time
RULERS - Executable rubrics and evidence-anchored scoring