Skip to content
Make your ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Source: huggingface.co

Make your ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation

Sources: https://huggingface.co/blog/zerogpu-aoti, Hugging Face Blog

TL;DR

  • ZeroGPU Spaces can now leverage PyTorch ahead-of-time (AoT) compilation to reduce latency and cold-start times for GPU tasks.
  • AoT enables exporting a compiled model once and reloading it instantly in new processes, delivering snappier demos on Hugging Face Spaces running on Nvidia H200 hardware.
  • Real-world speedups of about 1.3×–1.8× have been observed on models such as Flux.1-dev, Wan, and LTX; FP8 quantization can add approximately 1.2× more speedup.
  • The workflow relies on the spaces.aoti_capture and spaces.aoti_compile utilities, plus spaces.aoti_apply to safely patch the pipeline for inference without keeping heavy model parameters in memory.
  • Current MIG slice available on ZeroGPU is 3g.71gb; full slice 7g.141gb is expected in late 2025.
  • ZeroGPU combines with Flash-Attention 3 (FA3) to further accelerate workloads, with hardware and CUDA driver compatibility essential for AoT. For a deeper dive, see the original Hugging Face blog post. Source

Context and background

Hugging Face Spaces lets ML practitioners publish demo apps with managed hosting, including powerful Nvidia H200 GPUs. Traditional deployments with ZeroGPU reserve GPU capacity for the Space’s lifetime, even during idle periods. In a typical workflow, calling .to(‘cuda’) on a PyTorch model initializes the NVIDIA driver and pins a CUDA process, which can lead to non-ideal resource use given sparse and spiky traffic. ZeroGPU addresses this inefficiency by adopting a just-in-time approach to GPU initialization. Instead of forking up a long-lived CUDA process at startup, ZeroGPU forks the process when needed, sets up CUDA for the task, executes GPU work, and releases the GPU when the task completes. This behavior is enabled through the Python spaces package and the GPU decorator, making minimal changes to existing code while achieving significantly better resource utilization for short-lived tasks. In this post, the focus is on accelerating inference by exploiting ahead-of-time compilation, which complements the ZeroGPU model: it avoids the repeated compilation costs that occur with on-the-fly (JIT) compilation in a high-mobility, short-lived compute environment. PyTorch 2.0+ supports multiple compilation interfaces, and AoT specifically stands out for ZeroGPU because it exports a compiled program once and reloads it instantly in subsequent runs. This reduces framework overhead and eliminates cold-start timings typical of just-in-time approaches. References to the original explanation and demonstrations come from Hugging Face’s ZeroGPU AoT blog post. Source

What’s new

The core idea is to compile only the transformer component of the model (the denoiser portion in many generative architectures) ahead of time, since that part tends to dominate compute for these models. The workflow integrates several building blocks from the spaces package:

  • aoti_capture: Used as a context manager to intercept calls to the transformer, preventing execution and recording input arguments for later export. This yields example args and kwargs that reflect real inference-time inputs.
  • torch.export.export: Exports the captured transformer as a PyTorch ExportedProgram, which is a computation graph plus parameter values. This exported program forms the basis for AoT compilation.
  • spaces.aoti_compile: A small wrapper around torch._inductor.aot_compile that handles saving the compiled artifact and lazy-loading it as needed. The result is a compiled_transformer ready for inference.
  • spaces.aoti_apply: Patches the original pipeline by substituting the transformer’s forward with the compiled model while cleaning up old parameters to avoid memory pressure.
  • Wrapping in a @spaces.GPU function: All steps—input interception, export, and compilation—must occur inside a GPU-enabled block to ensure the compilation leverages the actual hardware and micro-benchmark tuning that hardware-dependent compilation requires.
  • Patch-safe integration: A naive replacement (pipe.transformer = compiled_transformer) can drop important attributes (dtype, config, etc.), and patching only the forward method can leave original parameters in memory. spaces.aoti_apply provides a safe, complete patch. In practice, starting from the ZeroGPU base example and using the Flux.1-dev transformer, users can achieve notable speedups: about 1.7× faster in Flux.1-Dev under AoT. This demonstrates how a single, well-tuned AoT compilation run can deliver immediate, repeatable benefits in a ZeroGPU Spaces workflow. Dynamic shapes and quantization broaden the optimization landscape. AoT can be combined with FP8 post-training dynamic quantization to accelerate image and video generation workloads. FP8 requires CUDA compute capability of at least 9.0, which is satisfied by Nvidia H200 GPUs used in ZeroGPU. TorchAO and AoT composition also enable shape dynamism. PyTorch’s export tooling supports specifying which inputs should be treated as dynamic. For Flux.1-Dev, the process involves defining a range of latent image resolutions and mapping the corresponding dynamic dimensions, then exporting with transformer_dynamic_shapes to capture those variations. In scenarios with substantial dynamism, like Wan’s video generation models, you can adopt a strategy of compiling separate models per resolution while sharing parameters and dispatching the correct one at runtime. A minimal example and a fully working Wan-based implementation illustrate this approach. Finally, since FA3 (Flash-Attention 3) is compatible with AoT, ZeroGPU Spaces can incorporate FA3 to speed up attention computations. FA3’s own compile/build time is hardware-dependent and can take several minutes, but it complements AoT by delivering additional throughput on the same hardware. For those who want a hands-on path, a complete demonstration is available in the zerogpu-aoti-multi.py example and a Wan 2.2 Space. The combination of AoT with FP8 quantization and dynamic shapes provides a practical, end-to-end workflow for building snappier demo apps on Spaces.

Why it matters (impact for developers/enterprises)

This work matters for developers and enterprises that rely on ZeroGPU Spaces to demonstrate ML capabilities without committing to long-running GPU leases. The AoT workflow reduces cold-start latency and runtime overhead by pre-exporting compiled models and reloading them instantly for each GPU task. This can translate into faster, more responsive demos and a smoother user experience for end users exploring generative models. From an architectural perspective, the ability to compile once and reuse across processes aligns well with ZeroGPU’s goal of efficient, on-demand GPU usage. The approach minimizes idle-time GPU reservation, which is particularly valuable for demo-driven workloads where user traffic is sparse and volatile. Enterprises and teams using Spaces gain several practical advantages:

  • Faster demos and iteration cycles due to reduced startup times.
  • Better resource utilization by avoiding continuous CUDA driver initialization in idle periods.
  • A path to higher throughput for model inference when combined with FP8 quantization and dynamic shapes.
  • The option to run multiple model resolutions by compiling separate AoT artifacts, dispatched at runtime as needed. The Spaces platform structure, including role-based access, also factors in. Pro users and Team/Enterprise organizations can create ZeroGPU Spaces, while general users can still access them. Pro and Team/Enterprise users receive 8× more ZeroGPU quota, enabling broader testing and demonstration deployments. References to the blog post and its demonstrations provide additional context for teams evaluating AoT adoption in Space-based demos. Source

Technical details or Implementation

The implementation described above hinges on a few concrete steps and considerations that practitioners can follow, within a ZeroGPU Spaces workflow:

  • Prepare example inputs using spaces.aoti_capture to represent the real inference-time tensors that will flow through the transformer. This step is critical to ensure the exported program captures the right tensor shapes and dynamic ranges for deployment.
  • Export the transformer to a PyTorch ExportedProgram with torch.export.export. The exported program encapsulates the computations and parameters needed for AoT compilation.
  • Compile the exported program with spaces.aoti_compile, which wraps torch._inductor.aot_compile and manages persistence and lazy loading of the compiled artifact. This yields compiled_transformer ready for use in inference.
  • Patch the pipeline safely with spaces.aoti_apply to replace the transformer.forward with the compiled artifact, while cleaning up old parameters to avoid out-of-memory (OOM) errors at runtime. This avoids common pitfalls when attempting straight-forward in-place replacements.
  • Wrap the entire sequence inside a @spaces.GPU function. This ensures that the compilation and tuning can leverage actual CUDA hardware and micro-benchmark data, as opposed to emulation outside GPU contexts.
  • Inference-with-AoT: The compiled transformer is reintroduced into the pipeline, enabling instant reuse in subsequent runs without re-exporting, thus eliminating cold-start delays that are typically incurred by JIT compilers like torch.compile in short-lived processes. Dynamic shapes and quantization go hand in hand with AoT. Here are a few practical knobs:
  • FP8 quantization with TorchAO: Enables post-training dynamic quantization for image/video generation. FP8 requires CUDA compute capability 9.0 or higher and is well-supported on H200-based ZeroGPU hardware. This path adds about a 1.2× speedup on top of AoT’s improvements.
  • Dynamic shapes: Export with transformer_dynamic_shapes to define which input dimensions are dynamic. This allows the AoT compiler to generate optimized code paths for a range of input resolutions and shapes. In Flux.1-Dev, the example inputs and dynamic ranges are determined by inspecting hidden_states across varied resolutions.
  • Multi-model per resolution (Wan): When dynamism is too great for a single compiled model, a practical approach is to compile one AoT artifact per target resolution and dispatch the appropriate one at runtime while sharing model parameters. Compatibility and limitations:
  • Hardware: AoT compilation relies on actual GPU hardware to perform micro-benchmarks and tuning; it cannot be fully emulated by CPU-based CUDA emulation. Therefore, the AoT workflow must run inside a @spaces.GPU block to ensure hardware correctness.
  • MIG sizes: ZeroGPU currently allocates an MIG slice of 3g.71gb (3 gigabytes GPU memory footprint) for H200; a full 7g.141gb slice is planned for late 2025. This affects how large the compiled components can be and what parts of the model can be loaded into GPU memory during inference.
  • FA3: Flash-Attention 3 is compatible with AoT in this workflow, providing additional speedups for attention calculations, though building FA3 from source and tuning it with AoT can take several minutes and remains hardware-dependent. To ground the discussion, Flux.1-dev, Wan, and LTX are mentioned as examples where speedups have been observed. In a demonstrative Flux.1-Dev case, a ~1.7× speedup was achieved with AoT on ZeroGPU. The approach is designed to be practical for Spaces-based demos where latency, responsiveness, and resource usage are critical. For more on the AoT workflow and its benefits, readers can consult the Hugging Face blog post directly. Source

Key takeaways

  • AoT compilation unlocks instant re-use of compiled models across ZeroGPU processes, reducing cold-start latency.
  • A minimal integration with spaces.aoti_capture, spaces.aoti_compile, and spaces.aoti_apply makes it feasible to retrofit existing Spaces demos for AoT.
  • FP8 quantization and dynamic shapes extend speedups, with FP8 potentially adding around 1.2× and dynamic shapes enabling accurate handling of input variability.
  • Wan’s per-resolution compilation approach demonstrates how to maintain speed across highly variable inputs.
  • FA3 compatibility further accelerates attention mechanisms, enhancing end-to-end throughput.

FAQ

  • How do I enable AoT in a ZeroGPU Space?

    The workflow uses aoti_capture to capture transformer inputs, torch.export.export to export the model, spaces.aoti_compile to compile it, and spaces.aoti_apply to patch the pipeline safely. This sequence must run inside a @spaces.GPU function.

  • What kinds of models benefit most from AoT on ZeroGPU?

    Generative models where the transformer or denoiser is the most compute-heavy component (e.g., Flux.1-dev, Wan, LTX) see the largest benefits due to reduced inference latency after compilation.

  • Can I use FP8 quantization with AoT?

    Yes. FP8 quantization with TorchAO can be combined with AoT to yield additional speedups, provided the hardware supports FP8 (CUDA compute capability 9.0+ on H200).

  • Are there memory considerations I should know about?

    Patch-based integration via spaces.aoti_apply helps avoid keeping the full original parameters in memory, mitigating OOM risks during runtime.

References

More news