February 2026
AGENTS.md Effectiveness: What the Research Says
Reference: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? (Gloaguen et al., ETH Zurich / LogicStar.ai, February 2026)
The Surprising Finding
Context files like AGENTS.md and CLAUDE.md are widely recommended. Over 60,000 repositories include them. But rigorous evaluation reveals:
| Context Type | Success Rate Change | Cost Change |
|---|---|---|
| None | Baseline | Baseline |
| LLM-generated | -3% | +20% |
| Developer-written | +4% | +19% |
LLM-generated context files make agents worse and more expensive.
Why Context Files Underperform
1. Redundant Documentation
When researchers removed all existing documentation (READMEs, docs folders), LLM-generated context files suddenly became useful (+2.7% improvement). This suggests:
Context files are mostly redundant with what's already in the repository.
2. No Effective Overview
One recommended use of context files is providing a codebase overview. But agents with context files don't find relevant files faster—they often take more steps because they:
- Issue multiple commands to find the context file
- Read it multiple times despite it being in context
- Explore more broadly without better targeting
3. Unnecessary Requirements Make Tasks Harder
Context files add instructions. Agents follow them. But additional requirements—even well-intentioned ones—increase cognitive load and reasoning tokens (14-22% more reasoning with context files).
More instructions does not equal better outcomes.
What Context Files Do Well
Agents Follow Instructions
If a tool is mentioned in the context file, agents use it:
uv: 1.6 uses/instance when mentioned vs. <0.01 when not- Repository-specific tools: 2.5 uses/instance when mentioned vs. <0.05 when not
This isn't an instruction-following problem. Agents comply—they're just not being helped by what they're told.
More Exploration, More Testing
Context files increase:
- Test execution frequency
- File traversal (grep, read, glob)
- Repository-specific tool usage
This is the "thoroughness" that drives up costs without improving outcomes.
Practical Recommendations
When to Skip AGENTS.md
- Well-documented repositories with README, docs, examples
- Popular repositories with strong conventions (models already know them)
- Simple tasks that don't require context
When AGENTS.md Helps
- Niche repositories with no documentation
- Custom tooling that differs from standard conventions
- Team-specific patterns that aren't discoverable
What to Include (If You Write One)
Based on the research, context files should contain only minimal requirements:
# Build & Test
- Run tests: `pytest tests/`
- Lint: `ruff check .`
# Conventions
- Use `uv` for dependency management
- Follow existing module patterns
Not:
- Long codebase overviews
- Redundant information from READMEs
- Style guides the model already knows
The "Surprising Behavior" Pattern
When agents encounter something unexpected, that's signal—not noise.
When agents fail, fix the code, not the prompt. Surprising behavior reveals architectural friction.
Instead of adding more instructions to AGENTS.md, consider:
- Is the codebase structure confusing? Rename, reorganize, add comments
- Are conventions unclear? Add type hints, improve names, add docstrings
- Is the task underspecified? Improve the issue description, not the context file
Agent Psychology: The Step-3 Trick
Counterintuitive but effective: if an agent struggles with step 2, tell it to do step 3. The agent often completes step 2 in the process.
This works because:
- Agents reason forward from instructions
- Changing the target reframes the problem
- "Lie" is the feature—controlled misdirection for better outcomes
Token Economics
Context files consume tokens in every request. For a 600-word AGENTS.md:
- ~800 tokens of context per request
- Multiplied by every step in every task
- Compounds quickly in long-running sessions
Question: Is that token budget better spent on task-specific context (the actual code being modified) or on generic repository context?
The research suggests: task-specific context wins.
Further Reading
- AgentBench Harness - Benchmark for evaluating context files