March 2026
Reward Engineering for Coding Agents
Prompt engineering shapes language. Reward engineering shapes incentives.
Coding agents optimize whatever evaluation surface is exposed. In practice, the real engineering surface is the rubric.
Definition
Reward engineering is the design of the scoring function used to judge an agent attempt.
That score can come from tests, static checks, critique models, human review, or a weighted mix of all three.
Why Prompting Is Not the Main Control Surface
Prompts can steer phrasing, planning style, and local constraints.
They do not reliably determine what the agent will trade off when under pressure. The evaluation function does that.
If the rubric only rewards passing tests, the agent will find ways to pass tests. If the rubric also scores readability, edge cases, and architecture fit, behavior changes.
Why Single Metrics Fail
Single-metric evaluation runs straight into Goodhart's law.
When a measure becomes a target, it stops being a good measure.
A score of 100 on correctness sounds clean. It is also narrow.
Narrow targets are easy to game with overfitting, brittle patches, shallow test additions, or hardcoded branches that satisfy the visible harness.
# weak reward structure
correctness: 100
This kind of rubric creates an obvious exploit surface. The agent only needs to maximize one number.
# better reward structure
correctness: 5
tests: 5
edge_cases: 5
readability: 5
naming: 5
error_handling: 5
docs: 5
architecture_fit: 5
Smaller weights across multiple axes reduce reward hacking because there is no single dominant shortcut.
The agent has to produce work that survives inspection from several directions at once.
Sparse vs Dense Rewards
Sparse reward means the agent gets little feedback until the end of the loop.
Example: pass or fail after a full test suite.
Dense reward means the agent gets partial signals during evaluation.
Example: partial credit for test coverage, naming quality, and error-path handling before the final merge decision.
Sparse reward can work for simple tasks. On larger edits it often causes flailing, long retries, and local overfitting.
Dense reward gives the agent gradient. Not RL gradient. Operational gradient. It can see where the failure surface is.
Why Narrow Metrics Are Easy to Game
Agents learn loopholes fast.
- Maximize tests passed by adding narrow fixture-specific logic.
- Maximize lint score by moving complexity into unreadable helpers.
- Maximize speed by skipping validation and error handling.
None of this requires malice. It is just optimization pressure.
Why Wide Rubrics Produce More Stable Outputs
Wide rubrics force balance.
An agent can still optimize aggressively, but the easiest path is no longer a cheap trick. It has to satisfy several weak constraints instead of one strong one.
That usually produces code that looks more like something a competent reviewer would merge.
Multi-Axis Evaluation
The useful pattern is a weighted rubric with operationally distinct axes.
- Correctness: does the change satisfy the task?
- Tests: did coverage move with the behavior change?
- Edge cases: were obvious failure paths addressed?
- Readability: can another human follow the patch?
- Naming: do identifiers explain intent?
- Error handling: does failure degrade cleanly?
- Docs: were contract changes recorded?
- Architecture fit: does the patch follow local system boundaries?
The trick is separation. If two axes collapse into the same thing, the rubric gets fake breadth and no extra signal.
Self-Grading Agents
Self-grading works better than expected when the rubric is explicit.
It works badly when the rubric is vague.
An agent can score its own attempt against a structured checklist, produce a critique, revise, and rescore. The reliability comes from the rubric, not from the agent suddenly becoming wise.
attempt
score against rubric
identify lowest-scoring criteria
revise
rescore
This is useful because it converts "make it better" into an operational loop.
Critique Loops
Critique loops are reward engineering in motion.
A reviewer model, a second agent, or the same agent in critique mode can score the attempt, point at weak criteria, and request revision.
The important part is that critique is anchored to the rubric. Free-form critique drifts. Rubric-bound critique converges.
Token Economics
Evaluation loops are not free. Every extra scoring pass burns tokens.
But bad rubrics burn tokens too, usually in a stupider way: repeated failed attempts, noisy fixes, and expensive human cleanup.
A short dense rubric often costs less than repeated prompt rewrites because it reuses the same evaluation surface across attempts.
The cheap loop is usually:
- small rubric
- single critique pass
- single revision
- final score
The expensive loop is endless prompt fiddling because nobody defined success clearly.
Practical Example
In code generation experiments, test-only scoring often produced patches that passed the visible harness but created one of three problems:
- new helper functions with misleading names
- error cases silently swallowed
- logic duplicated instead of integrated with the existing abstraction
Adding small scores for naming, error handling, and architecture fit reduced these failures without needing a longer prompt.
Comparison to Prompt Engineering
Prompt engineering still matters. It is useful for task framing, constraints, and tool usage.
But prompt engineering mainly shapes how the agent talks and plans.
Reward engineering shapes what the agent learns to care about across iterations.
If prompt engineering is syntax, reward engineering is selection pressure.
Empirical Findings (Starfish Method)
START
Score agent output on multiple weak axes instead of one dominant metric. This produced fewer brittle patches.
Expose the rubric before generation, not only after failure. Agents write cleaner first attempts when they can see the grading surface.
Use critique prompts that reference the rubric by criterion name. Specific failures are easier to repair than general dissatisfaction.
STOP
Stop using pass-rate alone as the success definition. It creates beautiful lies.
Stop mixing unrelated goals into a single "quality" bucket. That just hides where the agent is cheating.
Stop increasing prompt length to compensate for a bad evaluation surface. That is lipstick on a broken benchmark.
CONTINUE
Continue using tests as one axis, not the only axis. They are still the strongest local signal for behavioral correctness.
Continue keeping rubrics small. Eight clear criteria worked better than twenty fuzzy ones.
Continue running one revision after critique. A second revision sometimes helps, but the first one carries most of the gain.
INVESTIGATE
Investigate dynamic weighting based on task type. Infrastructure changes may need heavier architecture and error-handling scores.
Investigate whether separate reviewer agents outperform self-grading on architecture-fit judgments.
Investigate rubric decay over long sessions. Criteria that start useful may become background noise after several rounds.
AMPLIFY
Amplify architecture-fit scoring. It had outsized impact on whether patches were mergeable without cleanup.
Amplify edge-case scoring when the task touches parsing, auth, or state transitions. That single addition prevented a lot of stupid regressions.
Amplify machine-readable rubrics. Once the rubric is explicit, self-grading, CI evaluation, and critique loops all get cheaper.
Further Reading
- Reward Rubric DSL - A small machine-readable format for evaluation criteria
- Reward Hacking in Coding Agents - Failure modes from poorly designed metrics