Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2

Overview

General Matrix Multiply (GEMM) kernel selection on modern NVIDIA GPUs is a multi-parameter optimization problem. Kernel performance depends on a large set of compile-time and runtime meta-parameters, including CTA, warp, and instruction tile sizes; kernel schedules; rasterization strategies; cluster dimensions; and split-k factors. Traditional approaches brute-force thousands of configurations, compile them, and run exhaustive auto-tuning to identify the fastest option. This workflow can take hours and creates a bottleneck for offline-compiled libraries such as CUTLASS, and is particularly painful for JIT-compiling stacks like Torch Inductor or OpenAI Triton where fast model compilation matters. NVIDIA introduces nvMatmulHeuristics, a GPU kernel meta-parameter optimization module that provides fast heuristics to predict a small, high-potential set of GEMM configurations. The module analyzes a given GEMM problem and the target hardware capabilities to predict configurations that deliver maximum performance. The integration into CUTLASS 4.2 transforms the kernel generation and tuning workflow from brute force to a targeted, efficient path. For a given GEMM problem, nvMatmulHeuristics outputs a concise set of candidate configurations for testing. The feature is positioned as a core part of cuBLAS heuristics and is available in early access for general use, with integration into the CUTLASS library. A representative demonstration uses a single FP16 GEMM (A transposed or not, B, and C/D layouts denoted as tnn and similar). The workflow centers on feeding a GEMM problem list in JSON and building CUTLASS with specific CMake options to emit a small set of configurations per problem. In practice, this approach dramatically shortens the end-to-end tuning time. Rather than exhaustively compiling and profiling thousands of kernels, you generate a small, targeted candidate set and run profiling only on those. This is particularly valuable for environments like PyTorch where JIT compilation times are critical and offline libraries want near-optimal performance without long tuning cycles.

Key features

Predicts a small, top-N set of GEMM kernel configurations per problem
Reduces end-to-end tuning time by concentrating build and profiling on a focused candidate set
Integrated with CUTLASS 4.2 and cuBLAS heuristics; available in early access for general use
Accepts a JSON input describing GEMM problems; easy to feed into existing workflows
Generates a CSV test list from the emitted configurations for automated profiling
Compatible with cutlass_profiler to execute the generated configurations
Enables static cluster-size optimization at compile time, reducing runtime variability
Demonstrates near-exhaustive performance with substantially less work (e.g., 16 configurations yielding ~96% of peak performance in significantly less time)
Helps accelerate workflows in DL frameworks, compilers, and kernel libraries by enabling fast, high-quality kernel selection

Table: comparing tuning approaches

| Approach | How it works | Typical benefits | Notes |---|---|---|---| | Exhaustive search | Generate and compile thousands of kernels; run full auto-tuning | Best possible kernel found; maximum theoretical performance | Very long build+tuning times (examples exceed 700 minutes) |nvMatmulHeuristics (this feature) | Predicts top-N kernels per GEMM problem; compile a small set and profile | Substantial time savings; near-peak performance with far fewer builds | Requires problem list JSON and build integration |Precompiled/Blackwell-style | Use static/dynamic cluster sizes from precompiled kernels | Fast initial setup; reduced runtime tuning | May lock in less flexibility; performance depends on compile-time choices |

Common use cases

-DL framework integration: accelerate JIT paths (e.g., PyTorch) by reducing kernel-tuning latency while preserving high performance for GEMM workloads. -Offline libraries: allow precompiled, well-tuned kernel sets with minimal tuning overhead during deployment. -Model workloads with large GEMM footprints (e.g., Llama, DeepSeek scales) where traditional exhaustive tuning is prohibitive. -Scenarios requiring repeatable profiling results; the approach supports reproducible builds by emphasizing a small, designated set of kernel candidates.

Setup & installation (exact commands)

The workflow hinges on preparing a GEMM problem list in JSON, building CUTLASS with nvMatmulHeuristics enabled, and then running the generated profiles. The key explicit steps shown in the source are the following build-time flags that drive the heuristic workflow:

cmake .. \
-DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE= \
-DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM=N

After configuring with these options, a standard build will emit a CSV test list during the build process. You can then use the produced CSV to drive auto-tuning through a profiling tool (cutlass_profiler) that can consume the CSV and execute the configurations. For consistent profiling results, run with locked clocks. Notes:

Build CUTLASS as you normally would, but with the two CUTLASS_LIBRARY_HEURISTICS flags shown above.
The exact problem JSON format is not specified here; the workflow refers to a list of GEMM problems expressed in JSON.

Quick start (minimal runnable example)

Prepare a GEMM problem list in JSON (the example in the source mentions a single FP16 GEMM with A transposed (tnn) or not, B in column-major, and C/D layouts).
Build CUTLASS with heuristics enabled using the commands shown in the previous section.
The build system will emit a CSV test list describing the emitted configurations for each GEMM problem.
Run the profiling flow with cutlass_profiler on the generated CSV to execute the candidate kernels. For consistent results, use locked clocks during profiling.
Compare performance against an exhaustive baseline if you have one; in published cases, 16 candidate configurations reached 96% of peak performance with a fraction of the tuning time, and more candidates approached the exhaustive baseline while keeping significant time savings.

Pros and cons

Pros
Huge reductions in build+tuning time vs exhaustive search
Near-peak performance with a small candidate set
Better experience for JIT and on-device/deployed pipelines
Static cluster-size options can improve compile-time predictability and performance
Cons
Requires format and integration of a GEMM problem list in JSON
Early-access feature; availability and APIs may evolve
Still dependent on hardware capabilities and problem characterization accuracy

Alternatives (brief comparisons)

Exhaustive search: guarantees finding the best kernel but at very high cost in time and resources (examples show >700 minutes in some workloads).
Heuristic-based with static cluster sizes (precompiled): can achieve strong performance but relies on compile-time choices and may reduce runtime flexibility.
Dynamic/Blackwell-style precompiled kernels: commonly used in CUTLASS; allow run-time adaptation but may incur runtime tuning penalties; nvMatmulHeuristics aims to shift more work to static/compile-time decisions.

Pricing or License

nvMatmulHeuristics is described as a core part of cuBLAS heuristics and is available in early access for general use, with integration into CUTLASS. No pricing information is provided in the source.