Skip to content

Agent Skills — planning and design

This page covers the before-you-open-the-editor phase of authoring a Skill: choosing a candidate use case, defining success criteria for it, gathering technical requirements, drafting the directory and SKILL.md skeleton, and — most important — writing a description field that triggers reliably without spilling into other Skills' territory. Everything here is vendor-neutral; the vendor-extensions sub-page covers fields specific to Claude, Copilot, and Codex.

Planning a Skill is not paperwork. The single largest reason Skills fail to deliver value in practice is not bad code in scripts/ and not weak prose in the body — it is a description field that under-triggers (so the agent never picks the Skill at all) or over-triggers (so it loads the Skill on irrelevant turns and wastes context). Spending an hour on planning saves a week of "why is the model ignoring my skill" debugging.

1. Picking a use case

Skills repay investment when the same multi-step task comes up repeatedly and the steps are non-obvious. They underperform on one-off requests, on tasks the base model already handles well, and on anything that is really a tool or an always-loaded convention.

A useful triage triangle when evaluating a candidate use case:

  • Frequency. Does this task come up more than once a week, across the people who would use the Skill? If no, write a runbook or a Prompt File, not a Skill.
  • Procedure depth. Does the right answer require ≥3 steps the model wouldn't infer? If the answer is "the model can already do this in one prompt," the Skill will only add friction.
  • Specificity. Is the right answer team-specific or project-specific (you have a particular fields layout, a particular escalation path, a particular template)? If not, the model's pretraining probably covers it.

Concrete examples that pass all three: "create our team's release notes," "triage a production incident the way our runbook says," "extract structured tables from invoices into our schema." Examples that fail the triage: "format JSON" (general capability — model already does it), "find a bug in this code" (no fixed procedure), "remind me to take breaks" (no procedural depth).

Three categories of well-scoped use case — covered in the hub and repeated here for planning — frame nearly every successful Skill:

Category Example Where the work goes
Document / asset creation Generate a slide deck from a brief; render a PDF report; build an Excel model scripts/ runs the generator; assets/ holds templates
Workflow automation Triage a bug ticket; onboard a new hire; run a release checklist Body lists steps; scripts/ calls APIs; references/ carries policy detail
MCP enhancement Make a generic Jira/GitHub/Sentry MCP feel like your team's playbook Body lists the procedure; references/ carries the field maps and label conventions

If you cannot place the candidate Skill into one of these three, re-check the surface decision matrix in the hub page — there is a good chance it should be AGENTS.md content, an MCP server, a sub-agent, or a Prompt File instead.

2. Defining success criteria

Skills are model-invocable: the model decides whether to load and apply them. That makes "success" a behavioural question, not a code-coverage question. Before writing the Skill, write down — in plain prose — what triggering and executing the Skill should look like in three modes:

  • Positive trigger. Three to five example user messages that should make the model load the Skill. Each one phrased differently (synonym, slang, indirect reference). If you cannot list three, the use case is too narrow — or the description will end up too narrow to be useful.
  • Negative trigger. Three to five example user messages that should not load the Skill, especially ones in the neighbourhood of the use case. ("Convert this image" should not trigger a PDF-extraction Skill even though both are file operations.) Negative triggers are the antidote to over-triggering, which wastes context and pollutes other Skills' surfaces.
  • Execution result. What the Skill should do once triggered: outputs, side effects, intermediate steps. If the Skill mutates files, what guarantees does the procedure offer (idempotency, dry-run, rollback)? If the Skill calls external APIs, what counts as a successful call versus a retryable failure?

These three lists are not optional documentation; they are the test fixture for the iteration loop in the testing sub-page. Keep them in a references/TEST-CASES.md if you want them inside the Skill, or in a private test file otherwise.

3. Gathering technical requirements

Before writing SKILL.md, answer these questions:

  1. What scripts (if any) does the Skill need? Which languages and runtimes? Pin them in the open-spec compatibility field ("Requires Python 3.14+ and uv"). Skill scripts are run by the host runtime in whatever shell is available; if your script needs Python 3.14 and the host has Python 3.10, the Skill will silently fail at level 3.
  2. What external services / tools does the Skill talk to? If the answer is "yes, several," that is usually a signal to author an MCP server first and then write the Skill on top of it — the Skill becomes the workflow and the MCP server becomes the connectivity. Codex Skills make this explicit via the agents/openai.yaml sidecar's dependencies.mcp_servers list; Claude plugins make it explicit via the plugin manifest's mcpServers entry.
  3. What permissions does the Skill need? Read-only filesystem? Network access? Ability to spawn child processes? These map onto each vendor's allowlisting model — Anthropic allowed-tools, Copilot disable-model-invocation, Codex policy.allow_implicit_invocation — covered in the vendor-extensions sub-page.
  4. What templates / fixtures / fonts does the Skill need? Put them in assets/. Templates live with the Skill, not with each invocation.
  5. What policy / reference detail does the Skill need? Long-form material — compliance text, large field maps, error code dictionaries — goes in references/. The body of SKILL.md cites the references; the model only opens them when needed.

A useful rule of thumb on body weight: if SKILL.md's Markdown body exceeds ~2,000 words you are almost certainly leaving level-3 disclosure unused. Push detail into references/ and have the body say "see references/X for Y."

4. The directory skeleton

A typical Skill directory after planning looks like:

release-notes/
├── SKILL.md
├── scripts/
│   ├── build-notes.py        # writes the Markdown release notes
│   └── lint-notes.sh         # validates against the template
├── references/
│   ├── template.md           # canonical structure of a release note
│   ├── voice-and-tone.md     # team-specific style guidance
│   └── changelog-conventions.md  # commit-message → notes mapping rules
└── assets/
    └── header-logo.svg

For Codex specifically, an optional sibling sidecar file appears:

release-notes/
├── SKILL.md
├── agents/
│   └── openai.yaml          # Codex UI metadata, policy, MCP deps
└── …

The Codex sidecar is purely additive: nothing in SKILL.md changes. Claude and Copilot will ignore it; Codex picks it up at session start.

5. Writing the SKILL.md frontmatter

Frontmatter is YAML between two --- fences at the very top of SKILL.md. The open spec defines two required fields and a small set of optional ones; never invent your own required fields — every host runtime is allowed to ignore them.

Field Required Constraint Purpose
name yes lowercase, kebab-case, ≤64 chars, must match the directory name, must not contain the substrings claude or anthropic (per Anthropic guidance) Identifier for the Skill
description yes ≤1024 chars, prose Trigger metadata — load decision is made against this field
license no SPDX identifier Distribution metadata
compatibility no free text Runtime requirements (e.g. "Requires Python 3.14+ and uv")
metadata no key/value map Author, version, custom fields
allowed-tools no, experimental, Claude Code CLI only space-separated tool names Restricts which tools the Skill may call when active

Three constraints worth restating because they are the most common cause of a Skill that "won't load":

  • name must match the directory name exactly. A directory called release-notes/ with name: release_notes in frontmatter will silently fail to load on at least one of the three runtimes.
  • name must be kebab-case. Underscores, dots, capitals, and spaces are spec violations.
  • The substring claude or anthropic in name is reserved for first-party Skills in Anthropic's distribution; using them in a third-party Skill is bad form and may be rejected by future plugin-marketplace tooling.

6. Writing the description — the single most important field

The description is the only thing the model sees at session start for your Skill. If the description doesn't match the user's wording, your Skill never gets loaded. If it matches too broadly, your Skill gets loaded when something else should have been. Treat description-writing as a small but high-leverage discipline.

6.1 Anatomy of a good description

A description that triggers reliably contains two halves:

  1. What the Skill does — one sentence, imperative voice, concrete verbs, named outputs. "Extracts text and tables from PDFs and fills PDF forms." Not "helps with PDFs," not "PDF-related operations."
  2. When the Skill applies — one or two sentences starting with "Use when…" or "Trigger on…", listing concrete phrases the user might say. "Use when the user mentions PDFs, forms, document extraction, or 'this scan'."

Both halves matter. Without the first half the model loads the Skill but doesn't know what it is for. Without the second half the model never loads the Skill because the user's wording didn't match.

6.2 Worked examples

Description Verdict Why
"PDF stuff" ❌ Vague Won't trigger on most realistic user messages
"PDF processing tool" ❌ Generic Lacks both what and when
"Extracts text and tables from PDFs, fills PDF forms, merges PDFs. Use when the user mentions PDFs, forms, scans, or document extraction." ✅ Strong What + when, concrete verbs, multiple synonyms
"Master of all document operations including PDFs, Word, Excel, and anything else the user could possibly need." ❌ Over-broad Will over-trigger and crowd out Skills that should be loaded instead
"Generates this team's standard release notes from a git range. Use when the user says 'release notes', 'changelog', 'cut a release', or 'prepare the X.Y release'." ✅ Strong Specific to a workflow; lists realistic triggers

6.3 Common pitfalls

  • Don't pretend to be all-purpose. "Use when the user mentions documents" turns the Skill into noise that loads on every conversation about Word, Excel, slides, PDFs, and so on.
  • Don't write XML or HTML inside the description. The open spec forbids markup in this field; some runtimes strip it, others reject the Skill outright.
  • Don't include the vendor name (claude, anthropic, copilot, codex) inside the description. It is not banned, but it makes the Skill look first-party and is often mis-routed by triage layers.
  • Don't repeat the Skill's name in the description. The runtime already shows both side by side; doubling up wastes the 1,024-char budget.

6.4 Description-writing for a multi-Skill suite

When a project has many Skills, descriptions are how the model discriminates between them. Read all descriptions side by side and ask: if I were a user about to write a message, would each description be unambiguously the right or wrong match? If two descriptions overlap, narrow the wording until they don't. If two Skills genuinely overlap, that is usually a sign they should be merged.

7. Writing the body

The body of SKILL.md is Markdown. It is what the model reads when the Skill triggers (level 2 of progressive disclosure). The structure that works in practice:

# <Skill title>

<One-paragraph plain-English summary: what this Skill does and when.>

## Steps

1. <First action; reference scripts or references as needed>
2. <Second action>
3. <Final action and how to report results back to the user>

## Decision rules

- If <condition>, see `references/<file>.md` for the policy
- If the input is missing <field>, ask the user once and then proceed
- If <tool> fails, retry once with <variant>; if still failing, surface the error verbatim

## Examples

- User says: "<example input>" → expected output: <one-line shape>
- User says: "<another example>" → expected output: <one-line shape>

Body-writing rules that pay off:

  • Imperative second-person. "Run scripts/build.py," not "the agent should run scripts/build.py." The model is you when it reads this.
  • Numbered steps for procedures. Lists make sequencing explicit; paragraphs invite the model to skip.
  • Cite references; don't inline them. A reference file containing 50 KB of compliance rules costs zero tokens until the model opens it. The body's job is to tell the model when to open which reference.
  • Cite scripts; don't paste them. A script in scripts/extract.py runs without entering context. Putting its source in the body wastes context every single trigger.
  • Include 1–3 example pairs. Examples disambiguate the description and give the model an in-context template. Two well-chosen examples consistently beat ten lazy ones.

8. Examples — when to include them, when to omit

A Skill body without examples works fine for narrow, literal tasks ("extract a PDF"). It works poorly for judgement-driven tasks ("triage a bug ticket," "write a release note in our voice"). For the latter, three to five concrete example pairs in the body — or, better, in a references/EXAMPLES.md cited by the body — are the single biggest quality lever.

A useful pattern is input → reasoning → output triples:

### Example: medium-severity backend bug

**Input:** "P2 / cart fails on Safari for users with > 50 line items"

**Reasoning:** The bug is reproducible (specific browser + condition), affects checkout (revenue-impacting), and was filed by support (not a customer). It is P2 in our terms.

**Output:** Jira ticket in `CART` project, label `safari` and `large-cart`, assigned to current sprint, linked to the support thread.

That structure shows the model what "good" looks like and how to think on the way there. The model is much better at imitating worked examples than at following abstract instructions.

9. Vendor-portability decisions made at planning time

A few decisions you make now have to last across vendors. Make them deliberately:

  1. Pick a portable location. If the Skill is meant to be cross-vendor, put it under .agents/skills/<name>/ in the repo. Copilot and Codex read this natively; Claude can be pointed at it.
  2. Pick a portable tool-allow model. The three vendor security knobs do not interoperate. Default: write the Skill to need the minimum tools possible; rely on the host runtime's defaults; declare desired allowlists in vendor-specific sections at the bottom of SKILL.md or in the Codex sidecar.
  3. Pick a portable MCP-dependency model. If the Skill needs a particular MCP server, name it. Codex makes this declarative (agents/openai.yaml's dependencies.mcp_servers); for Claude / Copilot, document the dependency in the Skill body so the human installing the Skill knows to install the MCP server too.
  4. Pick a portable invocation model. The open spec assumes model-invocable. If your Skill is destructive (writes files, sends emails, calls billable APIs), think hard about whether you want it to auto-trigger at all — some vendors expose disable-model-invocation or allow_implicit_invocation=false for exactly this case.

10. Output of the planning phase

By the time you stop planning and open the editor, you should have, on paper:

  • A one-sentence Skill purpose and a one-sentence trigger.
  • 3–5 positive trigger examples, 3–5 negative trigger examples, and 1–3 execution-result examples.
  • A list of scripts (with languages) and references the Skill will need.
  • A list of MCP servers, external services, or environmental requirements.
  • A draft description field of the structure "What it does. Use when …"
  • A directory name in kebab-case that matches your intended name.

With those in hand, writing the actual SKILL.md and its supporting files takes minutes rather than days, and the iteration loop in the testing sub-page becomes a check rather than an exploration.

Changelog

  • 2026-05-11 — Page created from the Anthropic authoring guide PDF (chapter 2) generalized across Claude / Copilot / Codex. Confidence 88.