Reward Rubric DSL

March 2026

Agents behave more predictably when success criteria are explicit and machine-readable.

A small rubric DSL is often enough. No fancy RL stack required.

The DSL

The goal is to define evaluation criteria in a format that both agents and automation can parse.

reward_rubric {
  criterion correctness weight=5
  criterion tests weight=5
  criterion edge_cases weight=5
  criterion readability weight=5
  criterion naming weight=5
  criterion architecture weight=5
  criterion docs weight=5
}

This is intentionally small.

The useful part is not syntax purity. It is that the evaluation surface becomes visible, stable, and reusable.

Why It Helps

Plain-language prompts drift. Rubrics do not.

If the agent can read the criteria before writing code, it can aim for the right target on the first attempt.

If the evaluator can read the same criteria after the attempt, scoring becomes consistent across runs.

Typical Loop

task
→ agent attempt
→ rubric evaluation
→ critique
→ revision
→ rescore

This loop is simple enough to run in a shell script, CI job, or agent harness.

Use Cases

Autonomous Coding Agents

The rubric gives the agent a fixed target instead of a vague quality request.

CI Evaluation Pipelines

The same rubric can drive test gates, lint thresholds, reviewer prompts, and release checks.

Agent Self-Grading

An agent can score its own patch criterion by criterion, explain the weak points, then revise.

Review Automation

Review bots are less noisy when they have named criteria instead of free-form opinions.

Multi-Agent Critique Loops

One agent writes. Another scores. A third summarizes deltas between revisions. The rubric keeps the whole thing from turning into abstract nonsense.

Example Evaluation Pass

reward_rubric {
  criterion correctness weight=5
  criterion tests weight=5
  criterion error_handling weight=5
  criterion architecture weight=5
}

score {
  correctness 4 "behavior matches task, one edge path missing"
  tests 5 "new tests cover primary branch and regression case"
  error_handling 2 "invalid input path still throws raw exception"
  architecture 4 "fits existing module boundaries"
}

The critique step now has something concrete to act on. It can target the error_handling gap instead of waving its hands about code quality.

Real Workflow Example

A common use case is a small bug fix with one visible failing test and several likely hidden edge cases.

task {
  title "fix retry logic in api client"
  objective "retry 429 and 503 responses with backoff"
  constraints "do not add dependencies"
}

reward_rubric {
  criterion correctness weight=5
  criterion tests weight=5
  criterion edge_cases weight=5
  criterion readability weight=5
  criterion error_handling weight=5
  criterion architecture weight=5
}

The first agent attempt usually fixes the happy path and adds one regression test.

score {
  correctness 4 "429 retry works, 503 path incomplete"
  tests 3 "covers retry once, no max-retry test"
  edge_cases 2 "timeout and retry exhaustion not handled"
  readability 4 "patch is easy to follow"
  error_handling 2 "last failure reason is discarded"
  architecture 5 "fits existing client abstraction"
}

That score gives the critique loop something specific to do.

critique {
  revise edge_cases "handle retry exhaustion and timeout path"
  revise tests "add max-retry and 503 coverage"
  revise error_handling "preserve final upstream failure"
}

The second attempt is narrower. It is not "improve the patch." It is "raise the weak criteria."

rescore {
  correctness 5
  tests 5
  edge_cases 4
  readability 4
  error_handling 4
  architecture 5
}

At that point the pipeline can accept the patch with a simple rule.

accept_if {
  total_gte 27
  minimum correctness 4
  minimum tests 4
  minimum architecture 4
}

This is the practical advantage of the DSL. The loop is inspectable.

You can see why a patch passed, why it failed, and which criteria drove the next revision.

When the Agent Sees the Reward

The timing matters.

Direct coding-agent studies that compare "rubric shown up front" versus "revealed mid-task" versus "revealed only at the end" are still thin.

But adjacent evidence points in one direction: earlier reward visibility usually produces less backtracking and more stable behavior.

Reward Up Front

If the agent sees the rubric before it starts, it can plan around the real target instead of reverse-engineering it from failures.

This matches OpenAI's deliberative alignment results: models that read an explicit specification before acting referenced those principles in reasoning, and out-of-distribution scheming rates dropped sharply after that training setup.

For coding workflows, the practical implication is simple. Show the rubric before generation if you want first-pass behavior to track the evaluation surface.

Reward Mid-Task

Mid-task rubric feedback is usually the next best option.

It lets the agent redirect before the whole attempt hardens into a bad patch.

This pattern shows up in anticipatory and proactive reflection work. DEVIL'S ADVOCATE reports better efficiency by reflecting before each action instead of after a full trial. PASR reports that proactive refinement during generation improved accuracy while reducing token use by 41.6 percent on Qwen3-8B.

The inference for coding agents is straightforward: if a criterion becomes visible halfway through the task, the agent can still salvage the run, but it will usually pay in backtracking.

Reward Only at the End

End-only feedback still helps. It is just more expensive.

Self-Refine improved performance by about 20 percent absolute on average over one-step generation, and Reflexion improved coding pass@1 on HumanEval from 80 percent to 91 percent by learning from trial feedback.

But this is reactive. The agent has already spent tokens on a full attempt before learning what mattered.

A Practical Take

If you want the cleanest first draft, show the rubric up front.

If you want the cheapest repair path, surface low-scoring criteria as soon as they are detectable.

If you wait until the end, expect more revision loops and more opportunities for the agent to optimize appearances instead of intent.

There is also a catch: end-only self-grading can amplify self-bias. Separate work on self-refinement found that models tend to favor their own outputs, and that external feedback with accurate assessment reduces that bias.

Why This Is Usually Cheaper Than Full RL

Most coding-agent teams do not need full reinforcement learning infrastructure.

They need repeatable evaluation.

A rubric DSL is cheaper because it uses components most teams already have: prompts, scripts, tests, and CI.

It also fails more transparently. When the rubric is wrong, you can read the damn thing and fix it.

Design Notes

Keep criteria operational. "Good code" is useless. "Error handling" is testable.

Keep weights small and roughly balanced. Large single weights invite reward hacking.

Keep the rubric short enough to survive repeated use. If scoring it is annoying, nobody will keep using it.

Connection to Reward Engineering

This DSL is just a serialization of reward engineering principles.

The research point is simple: prompts shape language, but rubrics shape incentives.

A machine-readable rubric makes those incentives explicit, which is why it works well for self-grading, critique loops, and CI evaluation.

Minimal Implementation Pattern

1. define rubric
2. generate attempt
3. score each criterion
4. revise lowest scores
5. accept only if total and critical criteria pass

This is enough to get most of the value.

Related Research

Reward Engineering for Coding Agents - Why the rubric is the real control surface
Reward Hacking in Coding Agents - How narrow metrics get exploited
Self-Refine - Post-hoc iterative feedback improves over one-shot generation
Reflexion - Trial feedback improves coding-agent performance
PASR - Proactive refinement during generation improves accuracy and token efficiency
DEVIL'S ADVOCATE - Reflection before action can reduce backtracking