Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Sources: https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2, https://developer.nvidia.com/blog/improving-gemm-kernel-auto-tuning-efficiency-on-nvidia-gpus-with-heuristics-and-cutlass-4-2/, NVIDIA Dev Blog
Overview
General Matrix Multiply (GEMM) kernel selection on modern NVIDIA GPUs is a multi-parameter optimization problem. Kernel performance depends on a large set of compile-time and runtime meta-parameters, including CTA, warp, and instruction tile sizes; kernel schedules; rasterization strategies; cluster dimensions; and split-k factors. Traditional approaches brute-force thousands of configurations, compile them, and run exhaustive auto-tuning to identify the fastest option. This workflow can take hours and creates a bottleneck for offline-compiled libraries such as CUTLASS, and is particularly painful for JIT-compiling stacks like Torch Inductor or OpenAI Triton where fast model compilation matters. NVIDIA introduces nvMatmulHeuristics, a GPU kernel meta-parameter optimization module that provides fast heuristics to predict a small, high-potential set of GEMM configurations. The module analyzes a given GEMM problem and the target hardware capabilities to predict configurations that deliver maximum performance. The integration into CUTLASS 4.2 transforms the kernel generation and tuning workflow from brute force to a targeted, efficient path. For a given GEMM problem, nvMatmulHeuristics outputs a concise set of candidate configurations for testing. The feature is positioned as a core part of cuBLAS heuristics and is available in early access for general use, with integration into the CUTLASS library. A representative demonstration uses a single FP16 GEMM (A transposed or not, B, and C/D layouts denoted as tnn and similar). The workflow centers on feeding a GEMM problem list in JSON and building CUTLASS with specific CMake options to emit a small set of configurations per problem. In practice, this approach dramatically shortens the end-to-end tuning time. Rather than exhaustively compiling and profiling thousands of kernels, you generate a small, targeted candidate set and run profiling only on those. This is particularly valuable for environments like PyTorch where JIT compilation times are critical and offline libraries want near-optimal performance without long tuning cycles.
Key features
- Predicts a small, top-N set of GEMM kernel configurations per problem
- Reduces end-to-end tuning time by concentrating build and profiling on a focused candidate set
- Integrated with CUTLASS 4.2 and cuBLAS heuristics; available in early access for general use
- Accepts a JSON input describing GEMM problems; easy to feed into existing workflows
- Generates a CSV test list from the emitted configurations for automated profiling
- Compatible with cutlass_profiler to execute the generated configurations
- Enables static cluster-size optimization at compile time, reducing runtime variability
- Demonstrates near-exhaustive performance with substantially less work (e.g., 16 configurations yielding ~96% of peak performance in significantly less time)
- Helps accelerate workflows in DL frameworks, compilers, and kernel libraries by enabling fast, high-quality kernel selection
Table: comparing tuning approaches
| Approach | How it works | Typical benefits | Notes |---|---|---|---| | Exhaustive search | Generate and compile thousands of kernels; run full auto-tuning | Best possible kernel found; maximum theoretical performance | Very long build+tuning times (examples exceed 700 minutes) |nvMatmulHeuristics (this feature) | Predicts top-N kernels per GEMM problem; compile a small set and profile | Substantial time savings; near-peak performance with far fewer builds | Requires problem list JSON and build integration |Precompiled/Blackwell-style | Use static/dynamic cluster sizes from precompiled kernels | Fast initial setup; reduced runtime tuning | May lock in less flexibility; performance depends on compile-time choices |
Common use cases
-DL framework integration: accelerate JIT paths (e.g., PyTorch) by reducing kernel-tuning latency while preserving high performance for GEMM workloads. -Offline libraries: allow precompiled, well-tuned kernel sets with minimal tuning overhead during deployment. -Model workloads with large GEMM footprints (e.g., Llama, DeepSeek scales) where traditional exhaustive tuning is prohibitive. -Scenarios requiring repeatable profiling results; the approach supports reproducible builds by emphasizing a small, designated set of kernel candidates.
Setup & installation (exact commands)
The workflow hinges on preparing a GEMM problem list in JSON, building CUTLASS with nvMatmulHeuristics enabled, and then running the generated profiles. The key explicit steps shown in the source are the following build-time flags that drive the heuristic workflow:
cmake .. \
-DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE= \
-DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM=N
After configuring with these options, a standard build will emit a CSV test list during the build process. You can then use the produced CSV to drive auto-tuning through a profiling tool (cutlass_profiler) that can consume the CSV and execute the configurations. For consistent profiling results, run with locked clocks. Notes:
- Build CUTLASS as you normally would, but with the two CUTLASS_LIBRARY_HEURISTICS flags shown above.
- The exact problem JSON format is not specified here; the workflow refers to a list of GEMM problems expressed in JSON.
Quick start (minimal runnable example)
- Prepare a GEMM problem list in JSON (the example in the source mentions a single FP16 GEMM with A transposed (tnn) or not, B in column-major, and C/D layouts).
- Build CUTLASS with heuristics enabled using the commands shown in the previous section.
- The build system will emit a CSV test list describing the emitted configurations for each GEMM problem.
- Run the profiling flow with cutlass_profiler on the generated CSV to execute the candidate kernels. For consistent results, use locked clocks during profiling.
- Compare performance against an exhaustive baseline if you have one; in published cases, 16 candidate configurations reached 96% of peak performance with a fraction of the tuning time, and more candidates approached the exhaustive baseline while keeping significant time savings.
Pros and cons
- Pros
- Huge reductions in build+tuning time vs exhaustive search
- Near-peak performance with a small candidate set
- Better experience for JIT and on-device/deployed pipelines
- Static cluster-size options can improve compile-time predictability and performance
- Cons
- Requires format and integration of a GEMM problem list in JSON
- Early-access feature; availability and APIs may evolve
- Still dependent on hardware capabilities and problem characterization accuracy
Alternatives (brief comparisons)
- Exhaustive search: guarantees finding the best kernel but at very high cost in time and resources (examples show >700 minutes in some workloads).
- Heuristic-based with static cluster sizes (precompiled): can achieve strong performance but relies on compile-time choices and may reduce runtime flexibility.
- Dynamic/Blackwell-style precompiled kernels: commonly used in CUTLASS; allow run-time adaptation but may incur runtime tuning penalties; nvMatmulHeuristics aims to shift more work to static/compile-time decisions.
Pricing or License
- nvMatmulHeuristics is described as a core part of cuBLAS heuristics and is available in early access for general use, with integration into CUTLASS. No pricing information is provided in the source.
References
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.
Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow
A production-ready, modular telesurgery workflow from NVIDIA Isaac for Healthcare unifies simulation and clinical deployment across a low-latency, three-computer architecture. It covers video/sensor streaming, robot control, haptics, and simulation to support training and remote procedures.