We Got Claude to Build CUDA Kernels and teach open models!
Summary
The upskill tool enables the transfer of complex capabilities from large-scale "teacher" models, such as Claude Opus 4.5, to smaller or open-source "student" models using "agent skills." By converting execution traces into structured documentation and automated test cases, developers can upskill smaller models to perform highly specialized tasks like CUDA kernel development.
Key Points
upskillutilizes a standardized directory-based skill format located at{agent}/skills/{skill_name}/, consisting ofSKILL.mdfor instructions andskill_meta.jsonfor metadata and test cases.- The tool can significantly improve model accuracy on specialized tasks; for example, benchmarks showed an increase from 60% to 95% accuracy for certain models when using a generated skill.
upskillsupports evaluation across various providers, including Anthropic, OpenAI, and local models via OpenAI-compatible endpoints.- Beyond accuracy, the tool can be used to optimize token usage, though developers must monitor for cases where skills increase token consumption without performance gains.
- The
upskill generatecommand can create skills from raw agent traces (--from ./trace.md) or iteratively refine existing skills (--from ./existing_skill/).
Technical Details
The upskill workflow relies on capturing a successful execution trace from a high-capacity model. Using the upskill generate command, the tool parses this trace to produce a SKILL.md file that encodes domain-specific expertise, such as project structures, build configurations (e.g., build.toml), and hardware-specific optimizations (e.g., targeting NVIDIA H100 compute capability 9.0). Simultaneously, the tool generates a suite of test cases based on the trace to facilitate automated benchmarking.
To validate the effectiveness of a skill, the upskill eval command is used to run the generated test cases against a set of target models, comparing performance "with skill" against a "baseline" (without skill). This allows for the identification of models that benefit from the context injection versus those that may suffer from increased latency or token overhead. The tool is compatible with Python environments via pip install upskill or uvx upskill.
Impact / Why It Matters
This methodology provides a cost-effective alternative to fine-tuning by allowing developers to inject specialized knowledge into smaller, cheaper, or locally hosted models. It enables the deployment of high-performance, domain-specific agents that can handle complex engineering tasks with minimal infrastructure overhead.