From Zero to GPU: Building and Production-Ready CUDA Kernels with Kernel Builder

TL;DR

The Hugging Face kernel-builder streamlines creating, building, and publishing custom CUDA kernels that register as native PyTorch operators.
Use a reproducible Nix-based workflow (flake.nix, nix develop, nix build) to iterate locally and build for multiple PyTorch/CUDA variants for distribution.
Register operators with PyTorch APIs so they are visible to torch.compile and can have CPU and CUDA backends managed by PyTorch’s dispatcher.
Publish build artifacts to the Hub; use semantic version tags and the kernels library to manage versions and pin builds at project level.

Context and background

Custom CUDA kernels are a common way to accelerate model inference and data processing on GPUs. Writing a performant GPU kernel is only part of the work: production use requires reproducible builds, multi-architecture support, robust registration into the PyTorch ecosystem, and an easy distribution mechanism for downstream users. Hugging Face created the kernel-builder library and an associated workflow to make that full lifecycle practical. The toolchain emphasizes reproducible development shells, automated multi-variant builds across supported PyTorch and CUDA versions, and straightforward distribution via the Hugging Face Hub.

What’s new

This guide explains how to move from a single GPU function to a production-ready kernel pipeline using kernel-builder. Key elements demonstrated include:

Project scaffolding expected by kernel-builder, including the main build orchestration files and a flake.nix to pin dependencies.
Writing a CUDA implementation for an example kernel (RGB to grayscale conversion) and exposing it as a native PyTorch operator.
Using Nix development shells for fast iteration and explicit builds for particular PyTorch and CUDA combinations, for example PyTorch 2.7 and CUDA 12.6.
Building for multiple supported variants and publishing the resulting artifacts to a Hub repository so others can load the kernel directly from the Hub.

Why it matters (impact for developers and enterprises)

Performance: Custom kernels can deliver a large performance advantage when integrated as native operators and fused by PyTorch’s compilation and dispatch systems.
Reproducibility: flake.nix and Nix shells remove “it works on my machine” problems by locking exact toolchain versions and kernel-builder releases.
Compatibility: Registering the operator via PyTorch’s library macros makes the kernel visible to torch.compile and enables multiple backend implementations (CUDA, CPU) to coexist and be selected automatically based on tensor device.
Distribution and operations: Publishing prebuilt artifacts to the Hugging Face Hub removes typical packaging friction. Semantic versioning and the kernels locking mechanism allow teams to control upgrades and ensure reproducible deployments across users.

Technical details or Implementation

Project layout and build orchestration A predictable repository structure is the foundation. The kernel-builder expects specific files and directories to orchestrate compilation, registration, and packaging. Example files and their roles: | File / Directory | Purpose

---
`flake.nix`
`csrc/img2gray.cu`
`torch-ext/torch_binding.cpp`
`torch-ext/img2gray/__init__.py`
`CMakeLists.txt`, `pyproject.toml`, `setup.py`, `cmake/`
CUDA kernel and PyTorch registration
The example kernel uses a 2D grid of CUDA threads, which aligns naturally with image shapes. More importantly, the kernel is registered as a native PyTorch operator so it appears under `torch.ops`. Registration is handled in C++ binding files. The guide highlights two registration-related benefits:

Compatibility with torch.compile: by registering the operator via PyTorch library macros (e.g., using TORCH_LIBRARY_EXPAND), the operator is visible to PyTorch’s compilation pipeline, allowing fusion and other optimizations.
Backend flexibility: the registration pattern allows adding a CPU implementation with a macro like TORCH_LIBRARY_IMPL(img2gray, CPU, ...) so PyTorch will dispatch to the right backend according to the input tensor device. Iterative local development with Nix For fast iteration, use nix develop to enter a development shell that installs the exact CUDA and PyTorch versions you need. This reduces rebuild friction compared to re-invoking full multi-variant builds each change. Build a single-variant locally with nix build . -L or enter the dev shell for editable installs and rapid testing. The example mentions building for PyTorch 2.7 with CUDA 12.6 in the dev shell. Multi-variant builds and compliant kernels To reach a broad audience, build artifacts must be produced for all supported PyTorch and CUDA combinations. kernel-builder automates multi-variant builds and maintains a list of supported build variants. A compliant kernel is one that can be built and run across those supported versions. The multi-variant process can be time-consuming but produces build outputs in a result directory. Those results must be moved into the expected build directory before publishing. Publishing and distribution Once builds are ready, create a Hub repository, ensure you are logged in with huggingface-cli login, connect the project to the repo, and push. The example kernel and its build variants are published at https://huggingface.co/drbh/img2gray. Loading kernels in applications The kernels library does not use traditional local installation. Instead, users load kernels directly from a Hub repository; loading automatically registers the operator in the running Python process. For reproducible deployments and coordinated upgrades, the guide recommends semantic version tags (format vX.Y.Z) and version bounds when fetching a kernel. For example, tag releases like v1.1.2 and request versions such that downstream code can specify ranges (at least 1.1.2 but less than 2.0.0). Project-level kernel requirements can be declared in pyproject.toml under tools.kernels, and the kernels CLI can lock those requirements to a kernels.lock file that should be committed. Versioning and lifecycle management Using Git tags and the kernels lock file reduces breaking changes for downstream users. Users can also pin to a specific Git commit, but semantic versioning offers clearer upgrade semantics and compatibility guarantees for minor/patch releases.

Key takeaways

Use kernel-builder and a flake.nix to make kernel builds reproducible and portable.
Register kernels as native PyTorch operators so they are visible to torch.compile and the PyTorch dispatcher.
Use Nix dev shells for fast iteration and kernel-builder multi-variant builds for distribution across PyTorch/CUDA combos.
Publish artifacts to the Hugging Face Hub and manage compatible versions with semantic tags and the kernels lock workflow.

FAQ

How do I ensure my kernel is visible to PyTorch's compilation pipeline?

Register the kernel as a native operator using PyTorch library macros (for example via TORCH_LIBRARY_EXPAND). This makes it visible to `torch.compile` and allows fusion within larger computation graphs.
How can I test locally without rebuilding all variants?

Use `nix develop` to enter a development shell for a specific PyTorch and CUDA combination. Build and install the kernel in editable mode for quick iteration.
How do I distribute builds to other developers?

Build multi-variant artifacts with kernel-builder, move results into the expected build directory, create a Hub repository, and push using `huggingface-cli login` and Git push. Consumers can then load the kernel from the Hub.
What versioning strategy is recommended?

Use semantic versioning with Git tags of the form `vX.Y.Z`. Specify version bounds when fetching kernels and use the `kernels` CLI to generate a `kernels.lock` for reproducible project-level pinning.

References

Kernel builder guide: https://huggingface.co/blog/kernel-builder
Example Hub repository: https://huggingface.co/drbh/img2gray
Hugging Face CLI login: use huggingface-cli login as referenced in the guide
Note: The guide includes an accompanying YouTube video mentioned in the original post

From Zero to GPU: Building and Production-Ready CUDA Kernels with Kernel Builder

TL;DR

Context and background

What’s new

Why it matters (impact for developers and enterprises)

Technical details or Implementation

Key takeaways

FAQ

References

More news

NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity

Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference

How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo

Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling

Microsoft to turn Foxconn site into Fairwater AI data center, touted as world's most powerful

Monitor Amazon Bedrock batch inference using Amazon CloudWatch metrics