Skip to content
Make your ZeroGPU Spaces go brrr with PyTorch ahead-of-time compilation
Source: huggingface.co

Make your ZeroGPU Spaces go brrr with PyTorch ahead-of-time compilation

Sources: https://huggingface.co/blog/zerogpu-aoti, huggingface.co

TL;DR

  • ZeroGPU Spaces lets you spin up Nvidia H200 hardware in Hugging Face Spaces without dedicating a GPU to idle traffic.
  • PyTorch ahead-of-time (AoT) compilation exports a compiled model you can reload instantly, eliminating cold starts that plague on-demand GPUs.
  • Speedups of about 1.3x–1.8x are observed on models like Flux, Wan, and LTX, with additional gains from FP8 quantization and dynamic shapes.
  • The workflow uses aoti_capture, aoti_compile, and aoti_apply, and benefits from wrapping the work in a @spaces.GPU context.
  • In Flux.1-Dev, the approach yielded around 1.7x faster inference; FA3 integration can further accelerate workloads.

Context and background

Hugging Face Spaces enables ML practitioners to publish demo apps, but traditional deployments typically reserve a GPU for the Space’s lifetime, even during lean traffic periods. When a user code path calls .to(‘cuda’), PyTorch initializes the NVIDIA driver and locks the process to CUDA for the duration of the task. ZeroGPU adopts a just-in-time approach to GPU usage: it forks the process, initializes CUDA only as needed, executes GPU tasks, and then releases the GPU when traffic subsides. This helps improve resource efficiency in environments with highly sparse and bursty workloads. ZeroGPU currently allocates an MIG slice of H200 (3g.71gb profile). Additional MIG sizes, including a full slice (7g.141gb profile), are planned for late 2025. Modern ML frameworks like PyTorch and JAX support compilation that can optimize model latency or inference time. PyTorch (from 2.0 onward) offers two major interfaces for compilation: torch.compile works well in standard environments, compiling on first run and reusing the optimized version subsequently. On ZeroGPU, however, the ephemeral nature of the process means torch.compile cannot efficiently reuse compilations and tends to rely on the filesystem cache, which can incur tens of seconds to minutes of cold-start time. Ahead-of-time (AoT) compilation, by contrast, exports a compiled program once and reloads it instantly in any subsequent process. This aligns well with ZeroGPU’s short-lived tasks by reducing framework overhead and eliminating cold-start timings. To illustrate the workflow, we build on a ZeroGPU base example and focus on the transformer component of the pipe (the bottleneck in many generative models). In these models, the transformer or denoiser is the heaviest component to optimize.

What’s new

AoT on ZeroGPU is implemented by following a few key steps, demonstrated with the black-forest-labs/FLUX.1-dev model and the transformer component of the pipeline:

  • Use spaces.aoti_capture as a context manager to intercept calls to the transformer (e.g., pipe.transformer). This captures the input arguments (args and kwargs) without executing the call.
  • Export an exported PyTorch program via torch.export.export using the captured arguments, producing a computation graph with parameter values.
  • Compile the exported program with spaces.aoti_compile, a tiny wrapper around torch._inductor.aot_compile that handles saving and lazy-loading of the compiled artifact.
  • Bind the compiled transformer back to the pipeline with spaces.aoti_apply. This patches pipe.transformer.forward and removes old parameters from memory to avoid OOM, while preserving essential attributes such as dtype and config.
  • Run the compilation and patch inside a @spaces.GPU function to ensure true hardware-dependent compilation and optimal results. With these steps, a compiled_transformer becomes an AoT-compiled binary ready for inference. In practice, patching the transformer in a naive way (e.g., replacing pipe.transformer or patching forward only) can fail or cause memory issues; spaces.aoti_apply provides the safe, complete patching path. In the Flux.1-Dev transformer case, the approach yielded roughly a 1.7x speedup. This demonstrates the potential of AoT to improve the user experience in ZeroGPU Spaces, especially for demo-style workloads where latency and responsiveness matter more than raw throughput. AoT can be combined with quantization to deliver even greater speedups. FP8 post-training dynamic quantization schemes are well suited for image and video generation and can provide additional gains. FP8 requires a CUDA compute capability of at least 9.0. With H200-based ZeroGPUs, FP8 quantization is available and can be leveraged in the AoT workflow via TorchAO APIs. Expect another ~1.2x speedup from quantization in supported scenarios. Dynamic shapes add another axis of optimization. The export step uses primitives from torch.export.export to specify which inputs should be treated as dynamic shapes. For Flux.1-Dev, the dynamic-shape configuration involved defining a range of latent image resolutions and mapping which forward arguments are dynamic. A dynamic shapes object mirrors the structure of the example inputs, while non-dynamic inputs are set to None using PyTorch tree_map utilities. When exporting, you pass transformer_dynamic_shapes to torch.export.export. For more aggressive dynamic behavior, others have explored compiling one model per resolution (while sharing parameters) and dispatching the appropriate compiled model at runtime. A minimal example is zerogpu-aoti-multi.py; a fully working implementation of this approach appears in Wan 2.2 Space. In addition, because ZeroGPU hardware and CUDA drivers are compatible with Flash-Attention 3 (FA3), FA3 can be used inside ZeroGPU Spaces to push performance further. Note that compiling FA3 from source is a hardware-dependent, multi-minute process, so it is advantageous to wrap FA3-enabled components within the AoT workflow that runs on a GPU context.

Why it matters (impact for developers/enterprises)

The AoT workflow unlocks a more responsive demo experience on ZeroGPU Spaces by reducing the overhead associated with just-in-time compilation and cold starts. By exporting a compiled model and reloading it instantly, developers can deliver snappier demos without sacrificing the flexibility of a on-demand GPU tier. The integration also opens pathways to higher throughput for interactive applications that rely on large transformer blocks and denoisers. From an enterprise perspective, AoT brings predictable latency characteristics and reduces the risk of idle-resource waste. ZeroGPU’s quota model already offers flexibility, and AoT complements this by ensuring that GPU time is used only when necessary. For Pro, Team, and Enterprise tiers, the platform provides up to 8x more ZeroGPU quota, which can be a meaningful factor when planning demonstrations, pilots, or customer-facing experiments. Quantization with FP8 and the handling of dynamic shapes further enhance the deployment story. FP8 can boost speed by another ~1.2x in suitable workloads, while dynamic shapes allow models to accommodate variable image resolutions or video dimensions without recompiling. This combination of AoT, FP8, and dynamic shapes enables ZeroGPU Spaces to deliver faster, more scalable demos and experiments with less upfront provisioning.

Technical details or Implementation

This section outlines the core implementation workflow and the key APIs involved, focusing on the steps shown in the Flux.1-Dev example and related notes:

  • Intercept inputs with spaces.aoti_capture to capture example arguments for the transformer component (pipe.transformer) without executing it.
  • Export the captured inputs to a PyTorch ExportedProgram using torch.export.export, which yields a computation graph tied to the model’s parameters.
  • Compile the exported program with spaces.aoti_compile, which wraps torch._inductor.aot_compile and manages the save/load lifecycle of the compiled artifact.
  • Attach the compiled transformer back to the pipeline using spaces.aoti_apply. This patch replaces the necessary forward path while ensuring essential attributes like dtype and config are preserved and old parameters are cleaned to avoid memory issues.
  • Place the entire process inside a @spaces.GPU function. The AoT compilation is hardware-sensitive and benefits from executing within a GPU-enabled context. These steps collectively deliver a compiled, rapidly reloadable transformer for inference, avoiding the long cold-start times observed with on-demand CUDA initialization in ZeroGPU. In practice, the approach yields notable speedups (e.g., ~1.7x for Flux.1-Dev) and improved user experience in ZeroGPU Spaces. Dynamic shapes and quantization options further enrich the toolkit:
  • FP8 quantization: Enable FP8 with TorchAO in the AoT flow to gain additional speedups, subject to compute capability constraints (FP8 requires CUDA capability 9.0 or higher; H200 hardware supports FP8).
  • Dynamic shapes: Identify which inputs have dynamic dimensions (e.g., image resolutions) and construct a dynamic-shapes map for export. The dynamic shapes object mirrors the input signature and uses None for non-dynamic inputs. This enables AoT to specialize the compiled graph to the expected variability.
  • Multi-model per resolution: For workloads with widely varying input shapes (e.g., Wan family video generation models), it can be beneficial to compile separate AoT models per resolution while sharing parameters and dispatching the right model at runtime.
  • FA3 integration: Flash-Attention 3 (FA3) can be used with AoT to push performance further, though the FA3 compilation can be time-consuming and hardware-dependent. The practical upshot is that with minimal code changes and within a GPU-enabled context, ZeroGPU Spaces can deliver faster demo experiences, reduced startup overhead, and more consistent latency characteristics for transformer-heavy workloads.

Key takeaways

  • AoT compilation enables instant reloads of pre-compiled models in ZeroGPU Spaces, addressing cold-start limitations.
  • By capturing inputs, exporting a PyTorch program, compiling with AoT, and patching the pipeline via spaces.aoti_apply, you can achieve meaningful speedups with low incremental code changes.
  • Quantization (FP8) and dynamic shapes further boost performance, though FP8 requires compute capability 9.0+.
  • Wrapping the entire process in a @spaces.GPU function ensures hardware-specific optimization and proper lifecycle management in ZeroGPU.
  • For workloads with extreme shape dynamism, strategies such as compiling per-resolution models or using FA3 can help maximize gains.

FAQ

  • What is AoT compilation in this workflow?

    head-of-time compilation exports a compiled PyTorch program that can be loaded instantly in subsequent runs, reducing cold-start timings and framework overhead on ZeroGPU.

  • Why not rely on torch.compile for ZeroGPU?

    Torch.compile typically compiles the model on the first run and reuses the optimized version, but ZeroGPU processes are short-lived and spun up for each task, which makes caching less effective and leads to longer cold-start times.

  • What are the practical steps to implement AoT in ZeroGPU Spaces?

    Capture inputs with spaces.aoti_capture, export to a PyTorch ExportedProgram, compile with spaces.aoti_compile, and patch the pipeline with spaces.aoti_apply inside a @spaces.GPU function.

  • What speedups and limitations should I expect?

    Observed speedups typically range from 1.3x–1.8x across models like Flux, Wan, and LTX; FP8 quantization can add about 1.2x where supported. FP8 requires CUDA compute capability 9.0+. Dynamic shapes require careful export-time configuration.

  • Can quantization be used with AoT in this setup?

    Yes, FP8 quantization can be leveraged within the AoT workflow using TorchAO APIs to achieve additional speedups, subject to hardware capabilities.

References

More news