Skip to content

Agent Skills — testing and iteration

A working Skill has to clear three hurdles in order: the agent must trigger it on the right turns, the agent must execute it correctly when triggered, and the cost / latency of triggering and executing it must stay within a budget the user is willing to pay. This page covers how to test all three, the signals each one gives off when it fails, and the iteration moves that fix the most common failure modes.

The single biggest mistake in Skill testing is treating it like unit testing of scripts/. A Skill's failure modes are behavioural — the model loads the wrong one, doesn't load any, follows the body but skips a step, calls a tool the Skill didn't intend — and you cannot find any of that with pytest. The testing loop below is built around real conversations and falsifiable behavioural assertions.

1. Three testing modes

There are three orthogonal ways to test a Skill. Use all three; they cover different failure modes.

1.1 Manual testing

Run the Skill in a real agent session, in the host you are targeting (Claude Code, Copilot in VS Code, Codex CLI), and observe. Manual testing is the only way to catch failures that depend on the host's UI, tool surface, or system prompt. It is also where you build the intuition for what a "good" trigger looks like. The downside: it is slow, irreproducible, and gets stale fast.

Minimum manual test pass: walk through the positive trigger examples from the planning phase and confirm the Skill loads on every one; walk through the negative trigger examples and confirm it does not load on any; walk through the execution-result examples and confirm the output matches the expected shape.

1.2 Scripted testing

Write a short script (often Python or a Bash one-liner) that drives the agent's API or CLI with each trigger phrase from your planning sheet and records which Skills loaded, which tools were called, and what the final assistant message looked like. Scripted tests are reproducible, fast enough to run on every change, and most agents expose enough of their internal state (loaded Skills, tool calls, token counts) to make assertions worth writing.

Each of the three hosts surfaces this differently:

  • Claude Agent SDKClaudeAgentOptions(skills=[...]) plus a debug flag will log loaded Skills per turn. You can assert on the loaded Skill set.
  • VS Code Copilot — the Chat panel emits Skill load events in the developer log; you can scrape them from the VS Code remote SSH session.
  • Codex CLIcodex --debug prints the level-1 Skills list and which one(s) triggered.

A minimum scripted suite has one test case per planning-phase positive/negative/execution example, asserts a single fact, and runs in under a minute. Skill iteration without one of these suites is guesswork.

1.3 Programmatic / agent-as-evaluator testing

Use a second agent (or the same agent in a separate session) to grade the output of the Skill under test. This is the "model-graded eval" pattern. Useful for judgement-driven outputs (release notes in your voice, triage decisions with subjective fields) where there is no machine-checkable ground truth. The grader gets a rubric and the Skill's output; you collect grades over a fixed test set and watch the score across iterations.

Programmatic testing is not a replacement for the first two — it is good at scoring quality, weak at catching triggering and tool-use bugs. Use it on top of scripted tests, not instead of them.

2. Three things to test

Every Skill needs to pass three kinds of assertion.

2.1 Triggering

Does the Skill load on the right user turns and stay quiet on the wrong ones?

Concrete asserts to write:

  • For each positive trigger phrase, the Skill is in the loaded Skills list for that turn.
  • For each negative trigger phrase, the Skill is not in the loaded Skills list.
  • For each ambiguous neighbour Skill, the load decision goes the right way — i.e. if you also have a pdf-processing Skill and an image-processing Skill, "convert this scan" loads pdf-processing (scans are PDFs in your team's workflow) and does not load image-processing, or vice versa, depending on which one you intended to own scans.

Trigger failures are almost always a description problem. Fixes are covered in §4.1.

2.2 Functional behaviour

Once triggered, does the Skill produce the right outputs and side effects?

Concrete asserts to write:

  • The Skill calls the scripts it should call, with the arguments it should pass.
  • The Skill opens the references it should open and does not over-open ones it shouldn't.
  • The Skill respects the decision rules in the body — when input is missing, when a tool fails, when an edge case is hit.
  • The output shape matches the planning-phase execution-result examples.

Functional failures are usually a body problem (missing step, ambiguous rule, missing example) or a script problem (the script does the wrong thing). Fixes are covered in §4.2.

2.3 Performance / cost

Does triggering and executing the Skill stay within a budget?

Concrete metrics to record:

  • Tokens added to the system prompt at session start by the Skill's level-1 metadata.
  • Tokens added to the context when the Skill triggers (body weight).
  • Tokens added by reference files the Skill opens during execution.
  • Wall-clock time for the level-3 scripts.
  • Tool-call count per Skill execution.

Codex calls out the level-1 budget explicitly (~2% of the model's context window or 8,000 chars). Anthropic and GitHub don't, but the same pressure exists. Performance failures are usually fixed by pushing detail from level 2 to level 3 (move prose into references the model only opens when needed) or by splitting a fat Skill into two narrower Skills with distinct descriptions.

3. Helper Skills and tooling

Anthropic ships a meta-Skill called skill-creator in its first-party Skills bundle that interactively walks an author through use-case selection, description-writing, body-structure choices, and a packaging step. It is the easiest way to get a first-draft Skill out the door, and the conventions it produces are spec-aligned. The same conventions are good practice on Copilot and Codex even though those vendors don't ship an equivalent helper today.

For testing specifically, three things are worth assembling once and reusing forever:

  • A frozen positive/negative trigger fixture — a Markdown file with two columns of user phrases, expected outcome. Re-run it on every Skill change.
  • A loaded-Skills logger — three lines of host glue that prints which Skills were loaded on each turn. The exact lines differ across Claude, Copilot, and Codex but the role is identical: turn the black-box trigger decision into a white-box log line.
  • A scripted runner — drives the agent against the fixture, runs the logger, asserts on the result. A 50-line Python script is enough. Keep it next to the Skill in references/test/ or in a parallel test repo.

4. Iteration signals — reading what the test results say

Every Skill failure mode emits a characteristic signal. The table below is a diagnostic chart: failure signal → likely cause → first-pass fix.

Failure signal Likely cause First-pass fix
Skill never triggers on phrases it clearly should description doesn't mention the user's wording Add synonyms to the "Use when…" half; re-test
Skill triggers on the wrong turns description is too broad / overlaps a neighbour Narrow the "what it does" half; add an explicit not clause if needed ("Not for image-only PDFs — see ocr Skill")
Skill triggers but does nothing useful Body is too vague, missing numbered steps Rewrite body as imperative numbered steps; add one worked example
Skill triggers, follows steps 1–2, skips step 3 Step 3 is buried in prose, not in a numbered list Promote it to a numbered step and add a decision rule
Skill picks the wrong reference file Body cites references by content, not by name Cite references by filename and a one-line description of when to use each
Skill calls a tool the host doesn't have Tool allowlist mismatch or missing MCP server Make the dependency explicit in SKILL.md; for Codex, declare it in agents/openai.yaml
Skill runs slowly Body too fat; level-3 detail living at level 2 Move long prose to references/; have the body cite it
Skill works on Claude, not on Copilot or Codex Used a vendor-specific extension field Move the field to a vendor-extensions section at the bottom; pin the portable core to required fields
Skill works on Codex, not on Claude or Copilot Logic depending on agents/openai.yaml Move the logic into the body or replicate in vendor-specific docs
Skill silently fails at level 3 Wrong Python / Node / Bash version on the host Specify minimums in compatibility; pin in script shebangs; document the install command

Three meta-rules that hold for almost every failure:

  1. Always check the loaded-Skills log first. If the wrong Skill loaded (or none loaded), you have a description problem, not a body problem.
  2. Always change one thing at a time. Skills are sensitive to wording. Changing description + body + reference structure in one pass makes the test result uninterpretable.
  3. Always re-run the negative fixture. A fix that solves under-triggering by broadening the description often introduces over-triggering on neighbour Skills. Test both directions every time.

5. The iteration loop

The loop in practice:

  1. Write SKILL.md from the planning sheet.
  2. Run the scripted positive fixture. Fix any non-loads by widening or sharpening the description.
  3. Run the scripted negative fixture. Fix any spurious loads by narrowing the description.
  4. Run the scripted execution fixture. Fix any wrong outputs by tightening the body (numbered steps, decision rules, examples).
  5. Run a programmatic eval on the output quality (if applicable).
  6. Measure tokens and time. If over budget, refactor level 2 → level 3.
  7. Repeat with the cross-vendor fixture if the Skill is intended for more than one host.

The first pass through the loop is the slowest — typically an hour or two for a non-trivial Skill. Subsequent passes (when requirements change, when a vendor updates the spec, when a new neighbour Skill is added) should take minutes if the fixtures are in place.

6. Under-triggering vs over-triggering — the two failure shapes

Almost every triggering failure is one of two shapes:

6.1 Under-triggering

The user says "extract this PDF" and the model never loads the PDF Skill. Almost always a description that is too narrow, too literal, or uses team vocabulary the user wouldn't.

Fixes: - Add 3–5 synonyms to the "Use when…" half (PDF → document, scan, form, file). - Use the user's natural wording, not your internal vocabulary. - Make the "Use when" sentence start with the user's verb ("Use when the user asks to extract, parse, fill…") so the description matches the request literally.

6.2 Over-triggering

The user says "open this Word document" and the model loads the PDF Skill anyway. Almost always a description that is too broad ("documents", "files", "office") or that lacks an explicit boundary.

Fixes: - Replace generic nouns with specific ones (PDF, .pdf, scanned form — not "document"). - Add an explicit boundary ("Not for Word, Excel, or other Office formats — see those Skills."). - If two Skills genuinely overlap, merge them or split the boundary by use case (form-filling vs text-extraction).

7. Execution failures — the other failure shape

If a Skill triggers but produces the wrong output, the cause is almost never the description. Common shapes:

  • The body says "do X, then Y" but doesn't say what to do if X fails. The model improvises and breaks the contract. Fix: add an explicit decision rule for the failure case.
  • The body cites a reference but doesn't tell the model when to open it. The model never opens it. Fix: prefix the citation with the trigger ("If the user mentions a vendor-specific field, see references/vendor-map.md.").
  • The body has too many steps. The model truncates or skips. Fix: split the Skill into a parent procedure and one or more sub-procedures, each its own Skill.
  • The body has too few examples. The model invents a shape. Fix: add 1–3 worked examples in the input → reasoning → output structure.

A useful diagnostic: if the model's output is almost right but consistently wrong in one specific way, the body is missing a worked example showing the right shape. If the output is random across turns, the body is missing a decision rule.

8. Cross-vendor regression

Once a Skill works on the primary host, run the same fixture on the secondary host(s). The most common cross-vendor regression shapes:

  • A vendor-specific field is being used and the secondary host ignores it.
  • A path under .claude/skills/ works on Claude but the secondary host doesn't see it; symlink or move to .agents/skills/.
  • An MCP server is available on the primary host but not the secondary; the Skill body assumes it.
  • Tool naming differs — a Bash tool called Bash on Claude is called shell or terminal on another host. Avoid hard-coding tool names in the body where possible.

The vendor-portability checklist in the planning sub-page is what you want to revisit when a cross-vendor regression shows up.

9. When a Skill is done

A Skill is done when:

  • Every entry in the positive trigger fixture loads it.
  • Every entry in the negative trigger fixture leaves it un-loaded.
  • Every entry in the execution-result fixture matches the expected output shape.
  • The level-1, level-2, and level-3 token costs are inside budget.
  • The same fixture passes on every host the Skill is intended for.
  • The Skill has a version in metadata and the directory is committed to source control.

"Done" is not "perfect." Skills evolve as the agent platform evolves, as the model's defaults change, and as adjacent Skills are added. Plan for the fixtures to be re-run quarterly, and treat any cross-vendor change (Anthropic, Microsoft, OpenAI publishing a Skills update) as a refresh trigger.

Changelog

  • 2026-05-11 — Page created from the Anthropic authoring guide PDF (chapter 3) generalized across Claude / Copilot / Codex. Confidence 87.