Skill Creator：Skills 创建与优化工具链

一句话总结

skill-creator 是 Anthropic Skills 生态系统的元工具——它是用来创建 Skills 的 Skill。

核心能力

🔄 引导用户完成 Skill 创建的完整生命周期：捕获意图 → 调研 → 编写 → 评估 → 优化 → 发布
📊 提供量化的评估框架，自动运行测试用例并收集结果
🧪 支持基准测试（benchmark），可对比多次迭代的表现差异
✍️ 包含 description 优化器，自动改进 Skill 的触发描述
📦 内置打包工具，一键将 Skill 打包为可分发的格式

触发场景

当用户说 “create a skill”、“make a skill”、“build a skill”、“优化 skill 描述”、“评估 skill” 等时触发。

文件清单

One-Line Summary

skill-creator is the meta-tool of the Anthropic Skills ecosystem — it’s the Skill for creating Skills.

Core Capabilities

🔄 Guides users through the complete Skill creation lifecycle: capture intent → research → write → evaluate → optimize → publish
📊 Provides quantitative evaluation framework, automatically runs test cases and collects results
🧪 Supports benchmarking to compare performance variance across iterations
✍️ Includes description optimizer that automatically improves Skill trigger descriptions
📦 Built-in packaging tool to bundle Skills into distributable format

Trigger Scenarios

Triggers when user says “create a skill”, “make a skill”, “build a skill”, “optimize skill description”, “evaluate skill”, etc.

File Inventory

skill-creator
- SKILL.md 主入口 · ~200 行 · Markdown
- LICENSE.txt Apache 2.0
- scripts Python 脚本 · 8 个文件
  - __init__.py 包初始化
  - run_eval.py 120行 · ⭐⭐⭐ · 核心评估器
  - run_loop.py 100行 · ⭐⭐⭐ · 迭代运行器
  - quick_validate.py 45行 · ⭐⭐ · 快速校验
  - package_skill.py 80行 · ⭐⭐ · 打包工具
  - improve_description.py 60行 · ⭐⭐ · 描述优化器
  - generate_report.py 90行 · ⭐⭐ · 报告生成器
  - aggregate_benchmark.py 70行 · ⭐⭐ · 基准测试聚合
  - utils.py 50行 · ⭐ · 工具函数
- agents 子 Agent 定义 · 3 个文件
  - analyzer.md 分析器 agent
  - comparator.md 比较器 agent
  - grader.md 评分器 agent
- references
  - schemas.md JSON Schema 参考
- eval-viewer 评估查看器
  - generate_review.py HTML 报告生成器
  - viewer.html 交互式评估查看器
- assets
  - eval_review.html 评估报告 HTML 模板

目录结构分析

skill-creator 是目前仓库中 最复杂的 Skill。其结构体现了”脚本驱动型” Skill 的完整模式：

SKILL.md：约 200 行，详细描述了 Skill 创建的完整工作流程，包含 7 个阶段
scripts/：8 个 Python 脚本，形成完整的工具链。核心是评估系统（run_eval.py + run_loop.py），辅以打包、校验、报告等工具
agents/：3 个子 Agent 定义，分别负责分析结果、对比差异、和打分——这是将复杂评估逻辑拆分为独立 Agent 的经典模式
eval-viewer/：独立的 HTML 评估查看系统，展示”何时需要用户交互界面”

SKILL.md 结构解析

skill-creator 的 SKILL.md 约 200 行，结构层次清晰：

概述（第 6-26 行）：描述从捕获意图到发布的全流程
沟通指南（第 32-41 行）：指导 Claude 如何与不同技术水平的用户交流
创建流程（第 47-68 行）：详细的 4 步创建流程
Skill 写作指南（第 71+ 行）：Skill 的结构、最佳实践、触发技巧

YAML Frontmatter 分析

Directory Structure Analysis

skill-creator is the most complex Skill in the repository. Its structure embodies the complete “script-driven” Skill pattern:

SKILL.md: ~200 lines, detailed workflow description covering 7 phases
scripts/: 8 Python scripts forming a complete toolchain. The evaluation system (run_eval.py + run_loop.py) is the core, supplemented by packaging, validation, and reporting tools
agents/: 3 sub-agent definitions for analysis, comparison, and grading — the classic pattern of decomposing complex evaluation logic into independent agents
eval-viewer/: Standalone HTML evaluation viewer, demonstrating “when to build user interfaces”

SKILL.md Structure Analysis

skill-creator’s SKILL.md is ~200 lines with clear structural hierarchy:

Overview (lines 6-26): Describes full workflow from intent capture to publishing
Communication Guide (lines 32-41): How Claude should interact with users of varying technical levels
Creation Flow (lines 47-68): Detailed 4-step creation process
Skill Writing Guide (lines 71+): Skill structure, best practices, triggering tips

YAML Frontmatter Analysis

SKILL.md — YAML Frontmatter ↗ 源文件

1 --- 2 name: skill-creator 3 description: Create new skills, modify and improve existing skills, 4 and measure skill performance. Use when users want to create a 5 skill from scratch, edit, or optimize an existing skill, run evals 6 to test a skill, benchmark skill performance with variance analysis, 7 or optimize a skill's description for better triggering accuracy. 8 ---

代码解读

L2 name: 遵循 "小写+连字符" 命名规范，简洁直达核心功能。 L3 description: 是触发机制的关键。这里列出了 5 种具体场景（create / edit / optimize / run evals / benchmark / optimize description），确保 skill 在多种上下文中都能被正确触发。注意 description 的"pushy"风格——明确写了每一种可能的用户意图。

模块关系

skill-creator 的核心是 评估系统，围绕它组织脚本和 agent：

Module Relationships

The core of skill-creator is the evaluation system, around which scripts and agents are organized:

skill-creator 脚本与 Agent 关系

graph TD
  SKILL[SKILL.md] -->|指导| Claude
  Claude -->|调用| run_eval[run_eval.py]
  Claude -->|调用| run_loop[run_loop.py]
  Claude -->|调用| quick_validate[quick_validate.py]
  Claude -->|调用| package_skill[package_skill.py]
  Claude -->|调用| improve_description[improve_description.py]

  run_eval --> utils[utils.py]
  run_loop --> utils
  quick_validate --> utils
  run_eval --> generate_report[generate_report.py]
  run_loop --> generate_report
  aggregate_benchmark[aggregate_benchmark.py] --> generate_report

  run_eval --> grader[agents/grader.md]
  run_eval --> analyzer[agents/analyzer.md]
  aggregate_benchmark --> comparator[agents/comparator.md]

  generate_report --> generate_review[eval-viewer/generate_review.py]
  generate_review --> viewer[eval-viewer/viewer.html]

  style SKILL fill:#4fc3f7,stroke:#0288d1,color:#000
  style run_eval fill:#81c784,stroke:#388e3c,color:#000
  style run_loop fill:#81c784,stroke:#388e3c,color:#000
  style generate_report fill:#ffb74d,stroke:#f57c00,color:#000
  style grader fill:#ce93d8,stroke:#7b1fa2,color:#000

脚本全量清单

| 脚本 | 语言 | 行数 | 复杂度 | 功能 | |------|------|------|--------|------| | run_eval.py | Python | ~120 | ⭐⭐⭐ | 核心评估器：导入 skill，运行测试用例，收集结果 | | run_loop.py | Python | ~100 | ⭐⭐⭐ | 迭代循环器：重复运行评估直到收敛 | | quick_validate.py | Python | ~45 | ⭐⭐ | 快速校验：验证 skill 是否有基本问题 | | package_skill.py | Python | ~80 | ⭐⭐ | 打包器：将 skill 打包为可分发的 zip 归档 | | improve_description.py | Python | ~60 | ⭐⭐ | 描述优化器：分析并改进 skill 的 description 字段 | | generate_report.py | Python | ~90 | ⭐⭐ | 报告生成器：汇总评估结果生成 Markdown 报告 | | aggregate_benchmark.py | Python | ~70 | ⭐⭐ | 基准聚合器：多次运行结果的统计汇总 | | utils.py | Python | ~50 | ⭐ | 工具函数：日志、文件操作等共享工具 |

run_eval.py — 核心评估器

run_eval.py 是整个 skill-creator 的核心。它负责：加载一个 skill、对该 skill 运行一组测试用例、收集 Claude 的输出、并将结果传给 grader agent 进行评分。

run_eval.py is the core of skill-creator. It loads a skill, runs a set of test cases against it, collects Claude outputs, and passes results to the grader agent for scoring.

run_eval.py — 核心评估器 ↗ 源文件

1 """Run evaluations for a skill.""" 2 import argparse 3 import json 4 import os 5 import subprocess 6 import sys 7 from typing import Any 8 9 # Add parent to path for utils import 10 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) 11 from utils import log_info, log_error, load_skill, find_test_cases 12 13 14 def main(): 15 parser = argparse.ArgumentParser(description="Run skill evaluations") 16 parser.add_argument("skill_path", help="Path to the skill directory") 17 parser.add_argument("--test-dir", default="./tests", 18 help="Directory containing test cases") 19 parser.add_argument("--output", default="./results.json", 20 help="Output file for results") 21 parser.add_argument("--parallel", type=int, default=1, 22 help="Number of parallel workers") 23 args = parser.parse_args() 24 25 skill = load_skill(args.skill_path) 26 test_cases = find_test_cases(args.test_dir) 27 log_info(f"Found {len(test_cases)} test cases") 28 29 results = [] 30 for case in test_cases: 31 result = run_single_test(skill, case) 32 results.append(result) 33 log_info(f" {case['name']}: {result['status']}") 34 35 with open(args.output, "w") as f: 36 json.dump(results, f, indent=2) 37 log_info(f"Results written to {args.output}") 38 39 40 def run_single_test(skill: dict, case: dict) -> dict[str, Any]: 41 """Run a single test case and return the result.""" 42 try: 43 output = subprocess.run( 44 ["claude", "--skill", skill["path"], "--prompt", case["prompt"]], 45 capture_output=True, 46 text=True, 47 timeout=300, 48 ) 49 return { 50 "name": case["name"], 51 "status": "pass" if output.returncode == 0 else "fail", 52 "output": output.stdout, 53 "stderr": output.stderr, 54 "returncode": output.returncode, 55 } 56 except subprocess.TimeoutExpired: 57 return { 58 "name": case["name"], 59 "status": "timeout", 60 "output": "", 61 "stderr": "Test timed out after 300s", 62 } 63 except Exception as e: 64 return { 65 "name": case["name"], 66 "status": "error", 67 "output": "", 68 "stderr": str(e), 69 }

代码解读

L1 模块文档字符串——简洁描述脚本用途。Python 最佳实践。 L2 argparse: 命令行参数解析。subprocess: 执行外部命令（Claude CLI）。 L7 将当前目录加入 sys.path，确保 utils 模块可导入。常见于工具脚本中。 L11 main() 是标准入口。使用 argparse 定义 3 个参数 + 1 个可选参数。 L17 --parallel 参数预留了并发能力，默认值为 1（串行）。这是渐进式复杂度的好示例：先实现串行，再添加并发。 L21 load_skill() 和 find_test_cases() 来自 utils.py——将可复用的加载逻辑抽取到共享模块。 L30 主循环：逐个执行测试用例。简单清晰——没有过早优化。 L40 run_single_test() 是核心执行逻辑。使用 subprocess.run 调用 Claude CLI。 L43 capture_output=True 确保捕获 stdout/stderr。text=True 将字节转为字符串。timeout=300 避免测试无限挂起。 L53 完整的异常处理：超时异常和通用异常分别处理，返回结构化错误信息。 L60 所有返回值使用相同的 dict 结构（name, status, output, stderr），确保调用方可以统一处理。这是一个好的 API 设计。

run_loop.py — 迭代运行器

run_loop.py 封装了多次运行评估的逻辑，直到结果收敛或达到最大迭代次数。这是”评估-优化”循环的核心驱动。

run_loop.py encapsulates the logic of running evaluations multiple times until results converge or max iterations are reached. This is the core driver of the “evaluate-optimize” loop.

run_loop.py — 迭代运行器（关键逻辑摘录） ↗ 源文件

1 def run_loop(skill_path: str, test_dir: str, max_iters: int = 5): 2 """Run evaluation loop, iterating until convergence.""" 3 previous_score = 0 4 results_history = [] 5 6 for i in range(max_iters): 7 log_info(f"--- Iteration {i + 1}/{max_iters} ---") 8 9 # Run evaluation 10 result = run_eval(skill_path, test_dir) 11 current_score = result.get("average_score", 0) 12 results_history.append(result) 13 14 log_info(f"Score: {previous_score} -> {current_score}") 15 16 # Check convergence 17 if current_score <= previous_score: 18 log_info("Score stopped improving, stopping.") 19 break 20 21 previous_score = current_score 22 23 # Improve the skill based on feedback 24 if i < max_iters - 1: 25 improve_skill(skill_path, result.get("feedback", [])) 26 27 return results_history

代码解读

L1 max_iters=5 作为默认值——避免无限循环，同时给足够的迭代空间。 L6 日志语句清晰标注迭代进度，方便用户追踪。 L11 调用 run_eval()（来自 run_eval.py），复用核心评估逻辑。脚本间的接口通过函数调用实现。 L16 收敛条件：分数不再提升时停止。这是简单有效的 early stopping 策略。 L21 在每轮评估后自动改进 skill——将"评估-改进"闭环自动化。注意 if i < max_iters - 1 确保最后一次迭代后不再改进。

quick_validate.py — 快速校验

轻量级校验工具，在完整评估前快速检查 skill 的基本合规性。

Lightweight validation tool that quickly checks a skill’s basic compliance before full evaluation.

quick_validate.py — 快速校验（关键逻辑摘录） ↗ 源文件

1 def validate_skill(skill_path: str) -> list[str]: 2 """Quick validation of a skill. Returns list of issues.""" 3 issues = [] 4 5 # Check SKILL.md exists 6 skill_md = os.path.join(skill_path, "SKILL.md") 7 if not os.path.exists(skill_md): 8 issues.append("Missing SKILL.md") 9 return issues # Can't continue without it 10 11 # Parse frontmatter 12 with open(skill_md, "r") as f: 13 content = f.read() 14 15 frontmatter = parse_frontmatter(content) 16 if not frontmatter: 17 issues.append("Invalid or missing YAML frontmatter") 18 return issues 19 20 # Check required fields 21 if "name" not in frontmatter: 22 issues.append("Missing required field: name") 23 if "description" not in frontmatter: 24 issues.append("Missing required field: description") 25 26 # Check description quality 27 desc = frontmatter.get("description", "") 28 if len(desc) < 50: 29 issues.append(f"Description too short ({len(desc)} chars, want >= 50)") 30 31 # Check for scripts directory 32 scripts_dir = os.path.join(skill_path, "scripts") 33 if os.path.exists(scripts_dir): 34 py_files = [f for f in os.listdir(scripts_dir) if f.endswith(".py")] 35 if not py_files: 36 issues.append("scripts/ directory exists but contains no .py files") 37 38 return issues

代码解读

L5 第一道检查：SKILL.md 必须存在。如果缺失，直接返回——没有继续检查的必要。 L14 解析 YAML frontmatter。这个函数应该在 utils.py 中，体现了对共享工具函数的复用。 L19 检查必需字段 name 和 description——对应 Agent Skills Spec 的最低要求。 L26 description 质量检查：少于 50 字符时警告。这是一个启发式规则，作为低质量描述的快速过滤。 L30 可选目录检查：如果 scripts/ 存在但没有 .py 文件，发出警告——可能是配置错误。 L36 返回值设计：issues 列表。空列表表示通过，非空列表列出所有问题。这是清晰的 API 契约。

package_skill.py — 打包工具

package_skill.py 将完整的 skill 目录打包为可分发的 zip 归档。它自动包含 SKILL.md、scripts/、agents/ 和 references/ 目录，同时排除 pycache 等不需要的文件。打包结果可以直接分享给其他 Claude 用户使用。

package_skill.py bundles a complete skill directory into a distributable zip archive. It automatically includes SKILL.md, scripts/, agents/, and references/ directories while excluding pycache and other unnecessary files. The packaged result can be directly shared with other Claude users.

improve_description.py — 描述优化器

improve_description.py 分析现有 skill 的 YAML frontmatter description 字段，通过与用户实际对话模式对比来优化触发精度。它利用 grader agent 评估当前 description 的触发效果，并生成改进版本。这是唯一一个使用了 LLM 来优化自身元数据的脚本。

improve_description.py analyzes a skill’s YAML frontmatter description field and optimizes triggering accuracy by comparing it against actual user conversation patterns. It uses the grader agent to evaluate current description triggering effectiveness and generates improved versions. This is the only script that uses an LLM to optimize its own metadata.

generate_report.py — 报告生成器

generate_report.py 接收 run_eval.py 的输出 JSON，汇总成结构化的 Markdown 报告。报告包含各测试用例的通过/失败状态、平均分、失败模式分类以及改进建议。报告可以直接嵌入到 Claude 的对话上下文中，帮助用户了解 skill 的质量状况。

generate_report.py takes the JSON output from run_eval.py and summarizes it into a structured Markdown report. The report includes pass/fail status per test case, average scores, failure mode categorization, and improvement suggestions. Reports can be embedded directly into Claude’s conversation context to help users understand skill quality.

aggregate_benchmark.py — 基准测试聚合

aggregate_benchmark.py 跨多次评估运行汇总统计结果，生成基准测试对比。它使用 comparator agent 对比不同版本的表现差异，并检测回归。这对于 Skill 迭代过程中确保改进不回退非常重要。

aggregate_benchmark.py aggregates statistics across multiple evaluation runs to generate benchmark comparisons. It uses the comparator agent to contrast performance differences across versions and detect regressions. This is critical for ensuring improvements don’t regress during Skill iteration.

utils.py — 工具函数

utils.py 是所有脚本共享的工具函数集合，包含：日志输出（log_info / log_error）、skill 加载（load_skill）、测试用例发现（find_test_cases）、YAML frontmatter 解析（parse_frontmatter）等。这种”共享 utils”模式是高内聚低耦合的体现——每个脚本专注于自身逻辑，公共操作集中维护。

utils.py is a collection of shared utility functions used by all scripts, including: log output (log_info / log_error), skill loading (load_skill), test case discovery (find_test_cases), YAML frontmatter parsing (parse_frontmatter), and more. This “shared utils” pattern demonstrates high cohesion and low coupling — each script focuses on its own logic while common operations are centrally maintained.

脚本间关系图

skill-creator 脚本依赖图

graph LR
  A[run_eval.py] -->|import| U[utils.py]
  B[run_loop.py] -->|call| A
  B -->|import| U
  C[quick_validate.py] -->|import| U
  D[package_skill.py] -->|import| U
  E[improve_description.py] -->|import| U
  F[generate_report.py] -->|import| U
  G[aggregate_benchmark.py] -->|import| U

  B -->|call| E
  A -->|produces JSON| F
  G -->|produces JSON| F

  A --> G[agents/grader.md]
  G --> H[agents/analyzer.md]
  A --> I[agents/comparator.md]

  style U fill:#ffb74d,stroke:#f57c00,color:#000
  style A fill:#81c784,stroke:#388e3c,color:#000
  style B fill:#81c784,stroke:#388e3c,color:#000

设计亮点

评估驱动开发：skill-creator 的设计理念是”先有测试，再有 skill”——每个 skill 的优化都基于量化评估结果
Agent 分包：将复杂的评估逻辑拆分为 3 个独立 Agent（analyzer / comparator / grader），每个 Agent 专注一个维度
脚本原子化：每个 Python 脚本只做一件事（打包/校验/评估/报告），通过函数调用组合
渐进式复杂度：run_loop.py 默认串行执行，但预留了 —parallel 参数——不提前优化但保留扩展能力

可复用模式

移植思路

“如果你想做一个 XXX 领域的 skill 创建工具…”

保留核心评估框架（run_eval.py + agents/）——这是最值钱的部分
替换 test case 格式（从 skill 测试变为你的领域的测试）
重写 SKILL.md 中的工作流程（从 “skill creation” 变为你的领域流程）
保留 package_skill.py 的打包逻辑（通用）
根据需要调整 quick_validate.py 的检查规则

常见坑

⚠️ description 不要写太弱：description 是唯一的触发机制，太短或太泛会导致 skill 不被触发
⚠️ Agent 指令要具体：grader.md 中如果没有明确的评分标准，评估结果会不稳定
⚠️ subprocess timeout：调用外部命令时一定要设 timeout，否则可能无限挂起

Design Highlights

Evaluation-Driven Development: The design philosophy is “tests before skill” — every skill optimization is based on quantitative evaluation results
Agent Decomposition: Complex evaluation logic split into 3 independent Agents (analyzer / comparator / grader), each focused on one dimension
Atomic Scripts: Each Python script does one thing (packaging/validation/evaluation/reporting), composed via function calls
Progressive Complexity: run_loop.py defaults to serial but reserves —parallel parameter — no premature optimization but keeps extension path open

Reusable Patterns

Porting Guide

“If you want to create a skill creation tool for domain XXX…”

Keep the core evaluation framework (run_eval.py + agents/) — this is the most valuable part
Replace test case format (from skill tests to your domain’s tests)
Rewrite the workflow in SKILL.md (from “skill creation” to your domain workflow)
Keep package_skill.py packaging logic (generic)
Adjust quick_validate.py check rules as needed

Common Pitfalls

⚠️ Don’t write weak descriptions: description is the only trigger mechanism — too short or vague means the skill won’t trigger
⚠️ Make Agent instructions specific: Without clear grading criteria in grader.md, evaluation results become unstable
⚠️ Always set subprocess timeout: When calling external commands, always set timeout to prevent infinite hangs

模式	说明	适用于...
Agent 分包	复杂评估拆为多个独立 Agent	任何需要多维度判断的 skill
评估驱动迭代	量化评估 → 自动改进 → 再评估	需要持续优化的 skill
脚本工具链	每个脚本一个职责，通过 import 组合	任何脚本驱动型 skill
HTML 查看器	Python 生成 HTML，浏览器交互查看	需要可视化结果的 skill

Pattern	Description	Applies to...
Agent Decomposition	Split complex evaluation into independent Agents	Any skill needing multi-dimensional judgment
Eval-Driven Iteration	Quantify → auto-improve → re-evaluate	Skills requiring continuous optimization
Script Toolchain	Each script has one responsibility, composed via imports	Any script-driven skill
HTML Viewer	Python generates HTML, browser for interactive viewing	Skills needing visual results