Skip to content
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Source: developer.nvidia.com

Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training

Sources: https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training, https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/, NVIDIA Dev Blog

Overview

Fine-tuning gpt-oss for accuracy and performance leverages a two-stage workflow that combines supervised fine-tuning (SFT) at higher precision with quantization aware training (QAT) to recover target low-precision accuracy. The approach centers on upcasting to higher precision to stabilize gradients, followed by QAT to adapt the weights back to FP4, preserving deployment efficiency. This workflow is demonstrated on gpt-oss, the open-source foundational model family with a MoE architecture, 128K context length, and a largest variant of gpt-oss-120B achieving competitive open benchmarks. The complete recipe is implemented in the Model Optimizer repository and was adapted from Hugging Face’s gpt-oss-recipes to integrate QAT and related components. The central challenge addressed is recovering FP4 accuracy while maintaining the efficiency benefits of low-precision inference. By upcasting to BF16 for SFT and then applying QAT to MXFP4, the workflow reinforces task-specific behavior and aligns weights with the target low-precision format. The results demonstrate significant gains on downstream tasks and provide a path toward even tighter convergence with NVFP4 support.

Key features

  • Two-stage fine-tuning: high-precision SFT followed by quantization aware training (QAT) to FP4.
  • Upcasting mechanism: upcast to BF16 for stable gradient accumulation before applying QAT.
  • FP4 target formats: MXFP4 as the initial low-precision target, with NVFP4 as a forthcoming higher-accuracy FP4 variant.
  • Model scope: gpt-oss with MoE architecture, 128K context, up to gpt-oss-120B.
  • Code availability: the complete recipe is provided via the NVIDIA Model Optimizer repository.
  • Practical improvements: two downstream evaluation tasks improved from 16% and 30% pass rates to 98% pass rates after the recipe.
  • NVFP4 benefits: NVFP4 shows better convergence and 2–3% lower validation loss than MXFP4 in the same workflow.
  • Ecosystem readiness: upcoming NVFP4 support in NVIDIA TensorRT-LLM and priority enablement across other open-source inference frameworks.
  • Deployment workflow: after FP4 fine-tuning, a convenience script exports BF16-trained checkpoints to MXFP4, with validation across upstream SGLang, TensorRT-LLM, and vLLM; deployment demonstrated with TensorRT-LLM 1.1.0rc1.
  • Future trajectory: NVFP4 aims to deliver tighter convergence and improved margins for stricter thresholds and deeper reasoning.

Common use cases

  • Improve non-English reasoning and other task-specific behaviors using multilingual data (OpenAI Cookbook dataset).
  • Reduce unnecessary refusals of safe user prompts (FalseReject dataset from Amazon).
  • Deploy large open-source models in production environments requiring low-fault tolerance (healthcare, finance).
  • Enable higher-quality task performance on low-precision hardware without sacrificing deployment efficiency.
  • Prepare models for forward-looking hardware and frameworks (NVFP4 readiness in TensorRT-LLM and other inference frameworks).

Setup & installation

Setup and installation details are described in the Model Optimizer repository referenced by NVIDIA. The source notes that exact commands are not provided in the article itself, and users should consult the repository for code and scripts to implement the SFT + QAT workflow and FP4 exports.

# Setup commands not provided in the source. See the Model Optimizer repository for exact steps.

Quick start

The workflow is designed as a two-stage process: upcast to a higher precision for SFT, then apply QAT to return to the target FP4 precision, followed by exporting the checkpoint for deployment. A minimal, high-level outline is provided here (the source emphasizes that exact commands belong in the Model Optimizer repository and accompanying docs).

  1. Start from a gpt-oss baseline checkpoint (e.g., gpt-oss-120B).
  2. Upcast to BF16 and run supervised fine-tuning (SFT) to reinforce task-specific behavior.
  3. Apply quantization aware training (QAT) to align weights to MXFP4 (and later NVFP4).
  4. Export the resulting FP4-weighted checkpoint to a PyTorch-compatible format using the provided export tooling.
  5. Validate on downstream tasks and prepare for deployment with TensorRT-LLM. Note: The article states that skipping the high-precision SFT step and going straight to QAT yields lower accuracy, so the two-stage approach is recommended.
# Quick-start placeholder (conceptual)
print("Refer to the NVIDIA Model Optimizer repo for exact runnable steps.")

Pros and cons

  • Pros
  • Restores post-training accuracy while preserving FP4 deployment efficiency.
  • Upcasting enables stable gradient accumulation during SFT before QAT.
  • MXFP4 and NVFP4 provide practical paths for FP4-based inference with improved convergence (NVFP4 shows 2–3% better validation loss in comparisons).
  • The workflow yields high downstream pass rates (e.g., 98% on two targeted tasks).
  • Convenience tooling exists to export BF16-trained checkpoints to MXFP4 for deployment and validation across frameworks.
  • Cons
  • Requires a two-stage workflow (not just QAT alone), which can increase setup complexity.
  • NVFP4 support is upcoming, with full integration across TensorRT-LLM and other frameworks not yet universal.
  • Exact setup commands and code are provided in the Model Optimizer repository rather than the article itself, adding a dependency on external documentation.

Alternatives (brief comparisons)

| Approach | Notes | Pros | Cons |---|---|---|---| | MXFP4 with SFT + QAT (current proven path) | Two-stage FP4 recovery via upcast and QAT | Restores accuracy, maintains FP4 efficiency; validated against open benchmarks | Requires upcasting and QAT workflow; may need model-specific tuning |NVFP4 with SFT + QAT (upcoming) | FP4 format built for training and inference on Blackwell, with up to 15 PFLOPs FP4 compute | Potential tighter convergence; better validation loss; easier path to deeper reasoning | Availability depends on TensorRT-LLM and framework support; code changes may be needed (one-line adaptation noted) |SFT alone (no QAT) upcast | Not the recommended path in the source; upcast without FP4 targeting | Simpler workflow | Lower likelihood of FP4-accurate deployment; accuracy recovery incomplete |

Pricing or License

Pricing or licensing information is not provided in the source article.

References

More resources