AGENTS.md and Skills: useful, but narrower than the hype

Two recent papers challenge a popular assumption in agent tooling: “just add more guidance files and things get better.”

TL;DR

Both papers agree on one core point: extra guidance can help, but only under specific conditions.

  • AGENTS.md-style context files are often less useful than expected: developer-written files give only modest gains on average, LLM-generated files can regress performance, and both increase cost.
  • Skills can produce substantial gains when carefully curated, but effects vary heavily by domain and task, regressions are real, and self-generated skills show little to no average benefit.
  • The practical takeaway is not “never use AGENTS.md or skills.” It is: use minimal, high-signal, task-relevant guidance; avoid broad, bloated, or auto-generated boilerplate.

What the AGENTS.md evidence says

From the AGENTS.md paper:

  • Developer-provided context files: roughly +4% average improvement vs no context.
  • LLM-generated context files: roughly -3% average vs no context in the paper’s headline aggregate; in the main-results breakdown this is -0.5% (SWE-bench Lite) and -2% (AGENTbench).
  • Cost impact: >20% increase in inference cost on average.
  • Mechanism: agents generally follow these instructions, but spend more steps, more tool calls, and more reasoning tokens doing so.

This is a subtle but important result. The issue is usually not that agents ignore AGENTS.md. The issue is that they obey it in ways that increase overhead and can dilute focus on the shortest path to solving the task.

The paper also reports that context files become more useful in documentation-sparse setups. So AGENTS.md can be a compensating control for poor docs, but that is different from “it reliably improves performance in already-documented repos.”

What SkillsBench evidence says

From SkillsBench:

  • Curated skills improved mean pass rate by +16.2pp overall.
  • Strong domain variance: gains from +4.5pp (Software Engineering) up to +51.9pp (Healthcare).
  • Regressions exist: 16/84 tasks got worse with skills.
  • Self-generated skills: no average benefit (slightly negative overall in aggregate).
  • Design finding: 2-3 focused modules tend to outperform larger, comprehensive skill bundles.

So yes, skills can work well. But they are not universally positive, and they are highly quality- and domain-dependent.

Why these methods underdeliver in practice

The two papers point to a shared pattern:

  1. Instruction burden is real More directives can increase search, testing, and reasoning work even when they are followed correctly.

  2. Redundancy is expensive If AGENTS.md/skills mostly repeat existing docs or obvious conventions, they consume context budget and attention without enough incremental signal.

  3. Generality vs specificity tension Broad guidance often lacks task-specific utility, while highly specific guidance risks leakage, brittleness, or overfitting.

  4. Authoring quality dominates Curated, reviewed procedural knowledge can help a lot; auto-generated guidance is much less reliable.

  5. Benchmark averages hide tails Positive means can coexist with many negative tasks. Teams feel those regressions in production.

Caveats and where they still help

A fair reading is not “AGENTS.md and skills are useless.” It is:

  • They are most useful when repositories are under-documented or workflows are nonstandard.
  • They help most when they encode procedural know-how that is hard to infer from code alone.
  • They can materially lift weaker models in some settings.
  • They should be treated as operational artifacts that require maintenance, pruning, and measurement.

Also, both papers have scope limits (benchmark design, task distributions, and environment assumptions), so absolute numbers should be interpreted as directional guidance rather than universal constants.

Practical playbook for maintainers

If you maintain a repo and still want these mechanisms (which is reasonable), use a stricter playbook:

  • Keep AGENTS.md short and high-signal: build/test commands, repo-specific tooling, hard constraints.
  • Remove “nice-to-have” narrative sections that do not change decisions.
  • Prefer 2-3 focused skill modules over one mega-skill document.
  • Do not trust self-generated skills by default; require human review.
  • Track both success and cost (tokens/time/tool calls), not pass rate alone.
  • Run periodic ablations: with vs without AGENTS.md/skills on representative tasks.
  • Prune aggressively: if guidance does not move outcomes, delete it.

A good default in 2026 is minimal guidance first, then add only what shows measurable lift.

queryMT

all about generating by querying


2026-02-17