When should I bother building a Claude Code Skill?

Anthropic's own rule is the right one: build a skill only if you've done the task at least five times already and expect to do it ten more. Speculative skills are wasted work. The best backlog is a running list of friction points you've actually hit, not a list of capabilities you imagine wanting.

A working playbook for Claude Code Skills on Opus 4.7

Q: Why do skills sometimes silently fail to trigger?

Almost always the description. Claude Code combines every installed skill's name and description into a budget that scales as 1% of the context window with an 8,000-character fallback. Past the budget, descriptions get silently truncated and the keywords Claude needs to match your request disappear. Front-load the description, and raise SLASH_COMMAND_TOOL_CHAR_BUDGET to 30,000 if you have 50+ skills.

Q: What changed in Opus 4.7 that affects how skills behave?

Several things. The model takes instructions literally and won't fill in implied intent. Sampling parameters return 400 errors. The new tokenizer makes English content roughly 0–35% heavier. Default Claude Code effort is now xhigh, a new tier between high and max. Adaptive thinking is the only thinking-on mode and is hidden by default. And mid-output self-correction is real, so 'double-check your ordering before finalising' actually gets compliance now.

Skills are markdown files Claude reads when relevant. The description field decides whether a skill ever triggers. Opus 4.7 takes instructions literally, runs through a heavier tokenizer, and won’t let you reach for temperature anymore. The fix most teams are missing: a per-skill description cap that silently disables skills past a certain count.

You wrote a Skill. Claude ignored it.

Not bad output. No output. The skill sat on disk, the description was on the right topic, and Claude defaulted to its own behaviour anyway. Nine times out of ten it isn’t bad prose. It’s a structural issue you can’t see.

This is the working playbook for Claude Code Skills on Opus 4.7. It synthesises Anthropic’s engineering post, the official API docs, and the Claude Code Skills docs, with a focus on the parts that are easy to get wrong.

What a Skill is, briefly

A Skill is a folder. Inside the folder is a file called SKILL.md. At the top of SKILL.md is YAML frontmatter, and the rest is markdown. That’s it.

Anthropic shipped Skills as an open standard on December 18, 2025. OpenAI, Google, GitHub Copilot, and Cursor adopted the same format within weeks. Simon Willison called it “maybe a bigger deal than MCP.” It probably is.

The reason it scales is progressive disclosure. Three tiers:

Tier	What loads	When
1	The skill’s name and description	Every session, for every installed skill
2	The body of `SKILL.md`	When Claude decides the skill is relevant
3	Bundled files (scripts, references, templates)	Only when explicitly read or run

The metadata at tier 1 is cheap. About eighty tokens per skill. You can install forty skills and pay roughly fifteen hundred tokens of startup overhead. The body at tier 2 only loads when triggered. The bundled files at tier 3 only load when Claude reaches for them; a 500-line script that returns a 50-token result is essentially free at the model’s level.

Phil Whittaker reframed this as “progressive discovery.” The skill is a passive resource on disk. Claude is the active reasoner pulling content as needed. Useful framing. Structure your skill so Claude can find what it needs at each layer and decide whether to go deeper.

The description is the whole game

The description is the only field Claude sees at startup. It decides whether the skill ever loads. If the description is wrong, nothing else in the skill matters.

The rules are tight. Third person. Up to 1,024 characters. Both what the skill does and when to use it. Opens with a verb. The name is a separate field; up to 64 characters, lowercase letters and hyphens, gerund-ish (processing-pdfs).

And, as Anthropic’s own skill-creator puts it, the description should be slightly pushy. Claude undertriggers skills more often than it overtriggers. A polite description loses.

Here’s the contrast Anthropic ships in its own docs.

Won’t trigger reliably:

description: This skill helps with PDFs and documents.

Triggers cleanly:

description: >
  Comprehensive PDF manipulation toolkit for extracting text and tables,
  creating new PDFs, merging/splitting documents, and handling forms.
  Use when Claude needs to fill in a PDF form or programmatically process,
  generate, or analyze PDF documents at scale. Use for document workflows
  and batch operations. Not for simple PDF viewing or basic conversions.

Four useful signals: specific verbs, concrete use cases, trigger contexts, explicit boundaries. The “not for X” line is what keeps the skill from triggering on the wrong tasks.

One rule that’s easy to miss. Keep the workflow out of the description. Jesse Vincent, author of obra/superpowers, the most-installed skill plugin in the wild, flags it directly in his own writing skill. If you summarise the workflow in the description, Claude follows the description and never reads the skill. The description carries triggers. The body carries workflow. They are different jobs.

The silent-truncation trap

This is the most common cause of “my skill exists but nothing happens.”

Claude Code combines every installed skill’s name and description into a budget that sits in the system prompt. The budget scales as 1% of the context window, with an 8,000-character fallback. Past the budget, descriptions get silently shortened. All names stay. Descriptions get truncated to fit. The keywords Claude needs to match your request quietly disappear.

There is also a per-entry cap of 1,536 characters and, in the /skills listing itself, a 250-character cap on each description.

Two practical implications.

Front-load the description. The most important trigger phrases (verbs and user-language keywords) go in the first sentence. If the truncator gets you, it cuts the back half, not the front.

Raise the budget when you have many skills. The default is fine for ten or twenty. Past about fifty, you start losing them silently. Override:

SLASH_COMMAND_TOOL_CHAR_BUDGET=30000 claude

If a skill that “should” be triggering isn’t, this is the first thing to check.

The 5,000-token preservation window

Once a skill triggers, its body enters the conversation as a single message and stays there. When the context fills and Claude Code auto-compacts, the first 5,000 tokens of each invoked skill are preserved. The combined budget across re-attached skills is 25,000 tokens, filling from most-recently-invoked. Older skills can drop entirely after compaction.

Translation for the writer of the skill: the first 5,000 tokens of SKILL.md are the standing instructions. Anything past that is for-this-task-only and may not survive compaction in long sessions.

That maps onto a body length most teams overshoot. Anthropic’s own internal guide says under 5,000 words. The practical sweet spot is closer to 2,000–3,000 tokens. Past that, Claude’s attention to the bottom of the file dilutes whether the model “has access” or not.

A useful test. Would removing the bottom third of your SKILL.md make the skill measurably worse? If not, the bottom third was filler.

Scripts beat prose for anything deterministic

The under-used part of Skills is the scripts directory. Anything deterministic (validation, parsing, conversion, lint, schema checks) belongs in a script the skill calls, not in prose the skill recites.

The reason is the context window. When Claude runs validate_form.py, the script’s code never enters the conversation. Only its stdout does. A 500-line Python validator that returns “Validation passed” or three specific error messages is essentially free at the model’s level.

This is also where Simon Willison’s framing earns itself. Skills as “the spirit of LLMs”: markdown plus pre-written scripts the model composes when it needs them. You’re not encoding a rule in 200 lines of cautious prose Claude has to read every time. You’re encoding it in a deterministic check the model runs, and using prose only for the parts an LLM is good at.

Bundling decisions in practice:

Content	Where it goes	Why
Always-relevant procedure	`SKILL.md` body	Used every invocation
Sometimes-relevant detail	`references/foo.md`, linked from body	Only loads when needed
Deterministic computation	`scripts/*.py`	Code never enters the context window
Large reference (schemas, API docs)	`references/*.md`	No cost until read
Templates, boilerplate	`templates/`	Claude reads and adapts

One detail that catches a lot of people. When you reference a bundled file, write it as a command, not a hint. “If the user requests form filling, read references/FORMS.md before proceeding” gets followed. “For form-filling, see FORMS.md” often doesn’t. The Anthropic docs flag missed connections — references Claude doesn’t follow — as one of the most common iteration findings.

What 4.7 changed for skills

Opus 4.7 shipped on April 16, 2026. It’s materially different from 4.6 in ways that change how skills behave.

It takes instructions literally. A hint that worked on 4.6, something like “produce a summary, you know, the usual,” won’t be filled in on 4.7. The model executes what’s there, not what was implied. If your skill on 4.6 worked partly because Claude generalised your intent, 4.7 will expose that. Spell out audience, format, length, voice. Boris Cherny, who leads Claude Code at Anthropic, said on launch day he needed “a few days to learn how to work with it effectively.” That’s the adjustment.

Default Claude Code effort is now xhigh, a new tier between high and max. For coding and agentic work, design your skill assuming the model is running at xhigh.

The tokenizer changed. English content runs roughly one to one-and-a-third times as many tokens as on 4.6, per Anthropic’s number. Non-Latin scripts are more efficient. The same SKILL.md is more expensive than it was. Context discipline matters more, not less.

Sampling parameters return errors. No more temperature, top_p, top_k. Skills that depended on tuning sampling now break with a 400.

Adaptive thinking is the only thinking-on mode. Thinking content is hidden in the response stream by default. If your skill’s UX showed reasoning, opt in with display: "summarized" explicitly.

Mid-output self-correction is real. The model now flags and fixes its own inconsistencies inline. Instructions like “double-check your ordering before finalising” or “cross-check totals before returning” actually get compliance, where they were aspirational on 4.6. Analysis skills can lean on this.

High-resolution vision: 2576 px / 3.75 MP, up from 1568 px / 1.15 MP, with 1:1 pixel-coordinate mapping. Skills that ask Claude to verify a layout against a screenshot, count UI elements, or click coordinates are substantially more reliable. High-res images cost more tokens; downsample when the fidelity isn’t load-bearing.

Task budgets, in beta, give Claude a rough token budget across an agentic loop. Useful for skills that orchestrate long work: deploys, multi-file refactors, doc pipelines. Anthropic reports a meaningful drop in task abandonment on long horizons.

Don’t reach for the 1M context just because you can. Quality starts degrading well before that, around the 400,000-token mark. Progressive disclosure is more important on 4.7, not less.

Test the way you mean to ship

The most useful testing practice in the working community is the TDD-for-skills loop, popularised by obra/superpowers. Write a pressure scenario. Run it against a subagent without the skill, and watch it fail. That’s the baseline. Then write the skill. Run again. Watch it pass. Refactor to close loopholes.

The principle is uncomfortable and correct. If you didn’t watch an agent fail without the skill, you don’t know whether the skill teaches the right thing.

Test triggering and execution separately. A skill can fail because Claude never picks it up (description problem), or because Claude picks it up and produces the wrong output (body problem). Different fixes. Don’t conflate them.

Cover three classes of prompts: normal usage, edge cases, and out-of-scope. The skill should fire on the first two and stay quiet on the third. Simple one-step queries like “read this PDF” sometimes don’t trigger any skill at all, regardless of description. Complex multi-step queries are a better test.

The build-it-or-not test

The most useful rule in Anthropic’s playbook is also the most boring. Build a skill only if you’ve done the task five times already and expect to do it ten more times.

The shape of the failure when you skip this: a half-finished skill nobody trusts, a description that doesn’t trigger because the use case was hypothetical, prose that hedges every claim because the author wasn’t sure what the right move was. Speculative skills are wasted work.

The corollary. A running list of tasks you’ve done by hand more than five times is the best skill backlog you can have. Watch your own friction. The skill you should write next is the one whose work you’ve already done in this conversation.

A short pre-ship checklist

Anthropic’s full checklist runs about fifty bullets. The five that matter most:

The description is third person, opens with a verb, names both what the skill does and when to use it, and includes one explicit non-trigger.
The body is around 500 lines or under. The first 5,000 tokens contain the standing instructions, on the assumption that auto-compaction may drop the rest in long sessions.
Anything deterministic is in a script in scripts/, not prose in the body.
References to bundled files are written as commands (“read X before proceeding”), not as hints (“see X”).
You’ve actually run the skill against three realistic prompts and confirmed it doesn’t trigger on a deliberately out-of-scope one.

The full guide that the Anthropic team published as a PDF is worth reading once. After that, the only thing that improves your skill writing is failure data: skills that didn’t trigger, skills Claude ignored, skills that triggered on the wrong things. Keep the misses. They’re the brief for the next iteration.

The frustrating part is that none of this is hidden. The truncation, the compaction window, the description rules: all of it sits in the docs. The reason skill writing has a learning curve in 2026 is that “write the markdown” is the easy part, and “write the markdown so Claude actually reads it under the constraints of a finite system prompt” is the part that takes practice.

Sources

Anthropic Engineering — Equipping agents for the real world with Agent Skills (December 18, 2025)
Anthropic — Agent Skills overview (Claude API docs)
Anthropic — Extend Claude with skills (Claude Code docs)
Anthropic — What’s new in Claude Opus 4.7
Anthropic — The Complete Guide to Building Skills for Claude (PDF)
Simon Willison — Claude Skills are awesome, maybe a bigger deal than MCP (October 16, 2025)
Jesse Vincent — Superpowers: How I’m using coding agents and obra/superpowers on GitHub
Phil Whittaker — Progressive Discovery: A Better Mental Model for Agent Skills
anthropics/skills — Anthropic’s official skill collection

A working playbook for Claude Code Skills on Opus 4.7

What a Skill is, briefly

The description is the whole game

The silent-truncation trap

The 5,000-token preservation window

Scripts beat prose for anything deterministic

What 4.7 changed for skills

Test the way you mean to ship

The build-it-or-not test

A short pre-ship checklist

Sources

Keep reading

Claude Design produces AI slop unless you tell it not to

Two Google image models, two jobs: a working prompt guide for Nano Banana Pro and Nano Banana 2

Brand-fidelity mockups in Claude Code and Google Stitch: what actually steers them off the AI default

How to get Claude Opus 4.7 to write copy that doesn't sound like AI

Google's 2025 HCP targeting changes, read for Canadian pharma