跳转到内容

Skill Creator:Skills 创建与优化工具链

skill-creator 是 Anthropic Skills 生态系统的元工具——它是用来创建 Skills 的 Skill。

  • 🔄 引导用户完成 Skill 创建的完整生命周期:捕获意图 → 调研 → 编写 → 评估 → 优化 → 发布
  • 📊 提供量化的评估框架,自动运行测试用例并收集结果
  • 🧪 支持基准测试(benchmark),可对比多次迭代的表现差异
  • ✍️ 包含 description 优化器,自动改进 Skill 的触发描述
  • 📦 内置打包工具,一键将 Skill 打包为可分发的格式

当用户说 “create a skill”、“make a skill”、“build a skill”、“优化 skill 描述”、“评估 skill” 等时触发。

skill-creator is the meta-tool of the Anthropic Skills ecosystem — it’s the Skill for creating Skills.

  • 🔄 Guides users through the complete Skill creation lifecycle: capture intent → research → write → evaluate → optimize → publish
  • 📊 Provides quantitative evaluation framework, automatically runs test cases and collects results
  • 🧪 Supports benchmarking to compare performance variance across iterations
  • ✍️ Includes description optimizer that automatically improves Skill trigger descriptions
  • 📦 Built-in packaging tool to bundle Skills into distributable format

Triggers when user says “create a skill”, “make a skill”, “build a skill”, “optimize skill description”, “evaluate skill”, etc.

skill-creator 是目前仓库中 最复杂的 Skill。其结构体现了”脚本驱动型” Skill 的完整模式:

  • SKILL.md:约 200 行,详细描述了 Skill 创建的完整工作流程,包含 7 个阶段
  • scripts/:8 个 Python 脚本,形成完整的工具链。核心是评估系统(run_eval.py + run_loop.py),辅以打包、校验、报告等工具
  • agents/:3 个子 Agent 定义,分别负责分析结果、对比差异、和打分——这是将复杂评估逻辑拆分为独立 Agent 的经典模式
  • eval-viewer/:独立的 HTML 评估查看系统,展示”何时需要用户交互界面”

skill-creator 的 SKILL.md 约 200 行,结构层次清晰:

  1. 概述(第 6-26 行):描述从捕获意图到发布的全流程
  2. 沟通指南(第 32-41 行):指导 Claude 如何与不同技术水平的用户交流
  3. 创建流程(第 47-68 行):详细的 4 步创建流程
  4. Skill 写作指南(第 71+ 行):Skill 的结构、最佳实践、触发技巧

skill-creator is the most complex Skill in the repository. Its structure embodies the complete “script-driven” Skill pattern:

  • SKILL.md: ~200 lines, detailed workflow description covering 7 phases
  • scripts/: 8 Python scripts forming a complete toolchain. The evaluation system (run_eval.py + run_loop.py) is the core, supplemented by packaging, validation, and reporting tools
  • agents/: 3 sub-agent definitions for analysis, comparison, and grading — the classic pattern of decomposing complex evaluation logic into independent agents
  • eval-viewer/: Standalone HTML evaluation viewer, demonstrating “when to build user interfaces”

skill-creator’s SKILL.md is ~200 lines with clear structural hierarchy:

  1. Overview (lines 6-26): Describes full workflow from intent capture to publishing
  2. Communication Guide (lines 32-41): How Claude should interact with users of varying technical levels
  3. Creation Flow (lines 47-68): Detailed 4-step creation process
  4. Skill Writing Guide (lines 71+): Skill structure, best practices, triggering tips
SKILL.md — YAML Frontmatter ↗ 源文件
1 --- 2 name: skill-creator 3 description: Create new skills, modify and improve existing skills, 4 and measure skill performance. Use when users want to create a 5 skill from scratch, edit, or optimize an existing skill, run evals 6 to test a skill, benchmark skill performance with variance analysis, 7 or optimize a skill's description for better triggering accuracy. 8 ---
代码解读
L2 name: 遵循 "小写+连字符" 命名规范,简洁直达核心功能。 L3 description: 是触发机制的关键。这里列出了 5 种具体场景(create / edit / optimize / run evals / benchmark / optimize description),确保 skill 在多种上下文中都能被正确触发。注意 description 的"pushy"风格——明确写了每一种可能的用户意图。

skill-creator 的核心是 评估系统,围绕它组织脚本和 agent:

The core of skill-creator is the evaluation system, around which scripts and agents are organized:

skill-creator 脚本与 Agent 关系

graph TD
  SKILL[SKILL.md] -->|指导| Claude
  Claude -->|调用| run_eval[run_eval.py]
  Claude -->|调用| run_loop[run_loop.py]
  Claude -->|调用| quick_validate[quick_validate.py]
  Claude -->|调用| package_skill[package_skill.py]
  Claude -->|调用| improve_description[improve_description.py]

  run_eval --> utils[utils.py]
  run_loop --> utils
  quick_validate --> utils
  run_eval --> generate_report[generate_report.py]
  run_loop --> generate_report
  aggregate_benchmark[aggregate_benchmark.py] --> generate_report

  run_eval --> grader[agents/grader.md]
  run_eval --> analyzer[agents/analyzer.md]
  aggregate_benchmark --> comparator[agents/comparator.md]

  generate_report --> generate_review[eval-viewer/generate_review.py]
  generate_review --> viewer[eval-viewer/viewer.html]

  style SKILL fill:#4fc3f7,stroke:#0288d1,color:#000
  style run_eval fill:#81c784,stroke:#388e3c,color:#000
  style run_loop fill:#81c784,stroke:#388e3c,color:#000
  style generate_report fill:#ffb74d,stroke:#f57c00,color:#000
  style grader fill:#ce93d8,stroke:#7b1fa2,color:#000

| 脚本 | 语言 | 行数 | 复杂度 | 功能 | |------|------|------|--------|------| | run_eval.py | Python | ~120 | ⭐⭐⭐ | 核心评估器:导入 skill,运行测试用例,收集结果 | | run_loop.py | Python | ~100 | ⭐⭐⭐ | 迭代循环器:重复运行评估直到收敛 | | quick_validate.py | Python | ~45 | ⭐⭐ | 快速校验:验证 skill 是否有基本问题 | | package_skill.py | Python | ~80 | ⭐⭐ | 打包器:将 skill 打包为可分发的 zip 归档 | | improve_description.py | Python | ~60 | ⭐⭐ | 描述优化器:分析并改进 skill 的 description 字段 | | generate_report.py | Python | ~90 | ⭐⭐ | 报告生成器:汇总评估结果生成 Markdown 报告 | | aggregate_benchmark.py | Python | ~70 | ⭐⭐ | 基准聚合器:多次运行结果的统计汇总 | | utils.py | Python | ~50 | ⭐ | 工具函数:日志、文件操作等共享工具 |


run_eval.py 是整个 skill-creator 的核心。它负责:加载一个 skill、对该 skill 运行一组测试用例、收集 Claude 的输出、并将结果传给 grader agent 进行评分。

run_eval.py is the core of skill-creator. It loads a skill, runs a set of test cases against it, collects Claude outputs, and passes results to the grader agent for scoring.

run_eval.py — 核心评估器 ↗ 源文件
1 """Run evaluations for a skill.""" 2 import argparse 3 import json 4 import os 5 import subprocess 6 import sys 7 from typing import Any 8 9 # Add parent to path for utils import 10 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) 11 from utils import log_info, log_error, load_skill, find_test_cases 12 13 14 def main(): 15 parser = argparse.ArgumentParser(description="Run skill evaluations") 16 parser.add_argument("skill_path", help="Path to the skill directory") 17 parser.add_argument("--test-dir", default="./tests", 18 help="Directory containing test cases") 19 parser.add_argument("--output", default="./results.json", 20 help="Output file for results") 21 parser.add_argument("--parallel", type=int, default=1, 22 help="Number of parallel workers") 23 args = parser.parse_args() 24 25 skill = load_skill(args.skill_path) 26 test_cases = find_test_cases(args.test_dir) 27 log_info(f"Found {len(test_cases)} test cases") 28 29 results = [] 30 for case in test_cases: 31 result = run_single_test(skill, case) 32 results.append(result) 33 log_info(f" {case['name']}: {result['status']}") 34 35 with open(args.output, "w") as f: 36 json.dump(results, f, indent=2) 37 log_info(f"Results written to {args.output}") 38 39 40 def run_single_test(skill: dict, case: dict) -> dict[str, Any]: 41 """Run a single test case and return the result.""" 42 try: 43 output = subprocess.run( 44 ["claude", "--skill", skill["path"], "--prompt", case["prompt"]], 45 capture_output=True, 46 text=True, 47 timeout=300, 48 ) 49 return { 50 "name": case["name"], 51 "status": "pass" if output.returncode == 0 else "fail", 52 "output": output.stdout, 53 "stderr": output.stderr, 54 "returncode": output.returncode, 55 } 56 except subprocess.TimeoutExpired: 57 return { 58 "name": case["name"], 59 "status": "timeout", 60 "output": "", 61 "stderr": "Test timed out after 300s", 62 } 63 except Exception as e: 64 return { 65 "name": case["name"], 66 "status": "error", 67 "output": "", 68 "stderr": str(e), 69 }
代码解读
L1 模块文档字符串——简洁描述脚本用途。Python 最佳实践。 L2 argparse: 命令行参数解析。subprocess: 执行外部命令(Claude CLI)。 L7 将当前目录加入 sys.path,确保 utils 模块可导入。常见于工具脚本中。 L11 main() 是标准入口。使用 argparse 定义 3 个参数 + 1 个可选参数。 L17 --parallel 参数预留了并发能力,默认值为 1(串行)。这是渐进式复杂度的好示例:先实现串行,再添加并发。 L21 load_skill() 和 find_test_cases() 来自 utils.py——将可复用的加载逻辑抽取到共享模块。 L30 主循环:逐个执行测试用例。简单清晰——没有过早优化。 L40 run_single_test() 是核心执行逻辑。使用 subprocess.run 调用 Claude CLI。 L43 capture_output=True 确保捕获 stdout/stderr。text=True 将字节转为字符串。timeout=300 避免测试无限挂起。 L53 完整的异常处理:超时异常和通用异常分别处理,返回结构化错误信息。 L60 所有返回值使用相同的 dict 结构(name, status, output, stderr),确保调用方可以统一处理。这是一个好的 API 设计。

run_loop.py 封装了多次运行评估的逻辑,直到结果收敛或达到最大迭代次数。这是”评估-优化”循环的核心驱动。

run_loop.py encapsulates the logic of running evaluations multiple times until results converge or max iterations are reached. This is the core driver of the “evaluate-optimize” loop.

run_loop.py — 迭代运行器(关键逻辑摘录) ↗ 源文件
1 def run_loop(skill_path: str, test_dir: str, max_iters: int = 5): 2 """Run evaluation loop, iterating until convergence.""" 3 previous_score = 0 4 results_history = [] 5 6 for i in range(max_iters): 7 log_info(f"--- Iteration {i + 1}/{max_iters} ---") 8 9 # Run evaluation 10 result = run_eval(skill_path, test_dir) 11 current_score = result.get("average_score", 0) 12 results_history.append(result) 13 14 log_info(f"Score: {previous_score} -> {current_score}") 15 16 # Check convergence 17 if current_score <= previous_score: 18 log_info("Score stopped improving, stopping.") 19 break 20 21 previous_score = current_score 22 23 # Improve the skill based on feedback 24 if i < max_iters - 1: 25 improve_skill(skill_path, result.get("feedback", [])) 26 27 return results_history
代码解读
L1 max_iters=5 作为默认值——避免无限循环,同时给足够的迭代空间。 L6 日志语句清晰标注迭代进度,方便用户追踪。 L11 调用 run_eval()(来自 run_eval.py),复用核心评估逻辑。脚本间的接口通过函数调用实现。 L16 收敛条件:分数不再提升时停止。这是简单有效的 early stopping 策略。 L21 在每轮评估后自动改进 skill——将"评估-改进"闭环自动化。注意 if i < max_iters - 1 确保最后一次迭代后不再改进。

轻量级校验工具,在完整评估前快速检查 skill 的基本合规性。

Lightweight validation tool that quickly checks a skill’s basic compliance before full evaluation.

quick_validate.py — 快速校验(关键逻辑摘录) ↗ 源文件
1 def validate_skill(skill_path: str) -> list[str]: 2 """Quick validation of a skill. Returns list of issues.""" 3 issues = [] 4 5 # Check SKILL.md exists 6 skill_md = os.path.join(skill_path, "SKILL.md") 7 if not os.path.exists(skill_md): 8 issues.append("Missing SKILL.md") 9 return issues # Can't continue without it 10 11 # Parse frontmatter 12 with open(skill_md, "r") as f: 13 content = f.read() 14 15 frontmatter = parse_frontmatter(content) 16 if not frontmatter: 17 issues.append("Invalid or missing YAML frontmatter") 18 return issues 19 20 # Check required fields 21 if "name" not in frontmatter: 22 issues.append("Missing required field: name") 23 if "description" not in frontmatter: 24 issues.append("Missing required field: description") 25 26 # Check description quality 27 desc = frontmatter.get("description", "") 28 if len(desc) < 50: 29 issues.append(f"Description too short ({len(desc)} chars, want >= 50)") 30 31 # Check for scripts directory 32 scripts_dir = os.path.join(skill_path, "scripts") 33 if os.path.exists(scripts_dir): 34 py_files = [f for f in os.listdir(scripts_dir) if f.endswith(".py")] 35 if not py_files: 36 issues.append("scripts/ directory exists but contains no .py files") 37 38 return issues
代码解读
L5 第一道检查:SKILL.md 必须存在。如果缺失,直接返回——没有继续检查的必要。 L14 解析 YAML frontmatter。这个函数应该在 utils.py 中,体现了对共享工具函数的复用。 L19 检查必需字段 name 和 description——对应 Agent Skills Spec 的最低要求。 L26 description 质量检查:少于 50 字符时警告。这是一个启发式规则,作为低质量描述的快速过滤。 L30 可选目录检查:如果 scripts/ 存在但没有 .py 文件,发出警告——可能是配置错误。 L36 返回值设计:issues 列表。空列表表示通过,非空列表列出所有问题。这是清晰的 API 契约。

package_skill.py 将完整的 skill 目录打包为可分发的 zip 归档。它自动包含 SKILL.md、scripts/、agents/ 和 references/ 目录,同时排除 pycache 等不需要的文件。打包结果可以直接分享给其他 Claude 用户使用。

package_skill.py bundles a complete skill directory into a distributable zip archive. It automatically includes SKILL.md, scripts/, agents/, and references/ directories while excluding pycache and other unnecessary files. The packaged result can be directly shared with other Claude users.


improve_description.py — 描述优化器

Section titled “improve_description.py — 描述优化器”

improve_description.py 分析现有 skill 的 YAML frontmatter description 字段,通过与用户实际对话模式对比来优化触发精度。它利用 grader agent 评估当前 description 的触发效果,并生成改进版本。这是唯一一个使用了 LLM 来优化自身元数据的脚本。

improve_description.py analyzes a skill’s YAML frontmatter description field and optimizes triggering accuracy by comparing it against actual user conversation patterns. It uses the grader agent to evaluate current description triggering effectiveness and generates improved versions. This is the only script that uses an LLM to optimize its own metadata.


generate_report.py 接收 run_eval.py 的输出 JSON,汇总成结构化的 Markdown 报告。报告包含各测试用例的通过/失败状态、平均分、失败模式分类以及改进建议。报告可以直接嵌入到 Claude 的对话上下文中,帮助用户了解 skill 的质量状况。

generate_report.py takes the JSON output from run_eval.py and summarizes it into a structured Markdown report. The report includes pass/fail status per test case, average scores, failure mode categorization, and improvement suggestions. Reports can be embedded directly into Claude’s conversation context to help users understand skill quality.


aggregate_benchmark.py — 基准测试聚合

Section titled “aggregate_benchmark.py — 基准测试聚合”

aggregate_benchmark.py 跨多次评估运行汇总统计结果,生成基准测试对比。它使用 comparator agent 对比不同版本的表现差异,并检测回归。这对于 Skill 迭代过程中确保改进不回退非常重要。

aggregate_benchmark.py aggregates statistics across multiple evaluation runs to generate benchmark comparisons. It uses the comparator agent to contrast performance differences across versions and detect regressions. This is critical for ensuring improvements don’t regress during Skill iteration.


utils.py 是所有脚本共享的工具函数集合,包含:日志输出(log_info / log_error)、skill 加载(load_skill)、测试用例发现(find_test_cases)、YAML frontmatter 解析(parse_frontmatter)等。这种”共享 utils”模式是高内聚低耦合的体现——每个脚本专注于自身逻辑,公共操作集中维护。

utils.py is a collection of shared utility functions used by all scripts, including: log output (log_info / log_error), skill loading (load_skill), test case discovery (find_test_cases), YAML frontmatter parsing (parse_frontmatter), and more. This “shared utils” pattern demonstrates high cohesion and low coupling — each script focuses on its own logic while common operations are centrally maintained.


skill-creator 脚本依赖图

graph LR
  A[run_eval.py] -->|import| U[utils.py]
  B[run_loop.py] -->|call| A
  B -->|import| U
  C[quick_validate.py] -->|import| U
  D[package_skill.py] -->|import| U
  E[improve_description.py] -->|import| U
  F[generate_report.py] -->|import| U
  G[aggregate_benchmark.py] -->|import| U

  B -->|call| E
  A -->|produces JSON| F
  G -->|produces JSON| F

  A --> G[agents/grader.md]
  G --> H[agents/analyzer.md]
  A --> I[agents/comparator.md]

  style U fill:#ffb74d,stroke:#f57c00,color:#000
  style A fill:#81c784,stroke:#388e3c,color:#000
  style B fill:#81c784,stroke:#388e3c,color:#000
  1. 评估驱动开发:skill-creator 的设计理念是”先有测试,再有 skill”——每个 skill 的优化都基于量化评估结果
  2. Agent 分包:将复杂的评估逻辑拆分为 3 个独立 Agent(analyzer / comparator / grader),每个 Agent 专注一个维度
  3. 脚本原子化:每个 Python 脚本只做一件事(打包/校验/评估/报告),通过函数调用组合
  4. 渐进式复杂度:run_loop.py 默认串行执行,但预留了 —parallel 参数——不提前优化但保留扩展能力

“如果你想做一个 XXX 领域的 skill 创建工具…”

  1. 保留核心评估框架(run_eval.py + agents/)——这是最值钱的部分
  2. 替换 test case 格式(从 skill 测试变为你的领域的测试)
  3. 重写 SKILL.md 中的工作流程(从 “skill creation” 变为你的领域流程)
  4. 保留 package_skill.py 的打包逻辑(通用)
  5. 根据需要调整 quick_validate.py 的检查规则
  • ⚠️ description 不要写太弱:description 是唯一的触发机制,太短或太泛会导致 skill 不被触发
  • ⚠️ Agent 指令要具体:grader.md 中如果没有明确的评分标准,评估结果会不稳定
  • ⚠️ subprocess timeout:调用外部命令时一定要设 timeout,否则可能无限挂起
  1. Evaluation-Driven Development: The design philosophy is “tests before skill” — every skill optimization is based on quantitative evaluation results
  2. Agent Decomposition: Complex evaluation logic split into 3 independent Agents (analyzer / comparator / grader), each focused on one dimension
  3. Atomic Scripts: Each Python script does one thing (packaging/validation/evaluation/reporting), composed via function calls
  4. Progressive Complexity: run_loop.py defaults to serial but reserves —parallel parameter — no premature optimization but keeps extension path open

“If you want to create a skill creation tool for domain XXX…”

  1. Keep the core evaluation framework (run_eval.py + agents/) — this is the most valuable part
  2. Replace test case format (from skill tests to your domain’s tests)
  3. Rewrite the workflow in SKILL.md (from “skill creation” to your domain workflow)
  4. Keep package_skill.py packaging logic (generic)
  5. Adjust quick_validate.py check rules as needed
  • ⚠️ Don’t write weak descriptions: description is the only trigger mechanism — too short or vague means the skill won’t trigger
  • ⚠️ Make Agent instructions specific: Without clear grading criteria in grader.md, evaluation results become unstable
  • ⚠️ Always set subprocess timeout: When calling external commands, always set timeout to prevent infinite hangs
模式说明适用于...
Agent 分包复杂评估拆为多个独立 Agent任何需要多维度判断的 skill
评估驱动迭代量化评估 → 自动改进 → 再评估需要持续优化的 skill
脚本工具链每个脚本一个职责,通过 import 组合任何脚本驱动型 skill
HTML 查看器Python 生成 HTML,浏览器交互查看需要可视化结果的 skill
PatternDescriptionApplies to...
Agent DecompositionSplit complex evaluation into independent AgentsAny skill needing multi-dimensional judgment
Eval-Driven IterationQuantify → auto-improve → re-evaluateSkills requiring continuous optimization
Script ToolchainEach script has one responsibility, composed via importsAny script-driven skill
HTML ViewerPython generates HTML, browser for interactive viewingSkills needing visual results