NVIDIA Nemotron Nano 2 9B Tops Leaderboard with 6x Throughput for Edge AI
Sources: https://huggingface.co/blog/nvidia/supercharge-ai-reasoning-with-nemotron-nano-2, https://huggingface.co/blog/nvidia/supercharge-ai-reasoning-with-nemotron-nano-2?nvid=nv-int-tblg-513492+, huggingface.co
TL;DR
- NVIDIA Nemotron Nano 2 9B is an open, enterprise‑grade reasoning model targeting edge deployments with high accuracy and efficiency. It combines a Hybrid Transformer–Mamba backbone and a configurable thinking budget.
- NVIDIA reports about 6x higher throughput versus the next best open model in its size class, and a potential saving of up to 60% in inference costs when using the thinking budget.
- The model is built from a post‑training process starting from a 12B teacher, then pruned to 9B with distillation to preserve accuracy, and it fits within A10G memory with a 128k context window.
- It supports two thinking modes: ON (with chain‑of‑thought tokens) and OFF (direct answer), with ON as default and tags for controlling thinking, such as the tag.
- NVIDIA is releasing the Nemotron family as open weights/datasets, aiming to support the open‑source community and customizable deployments via NVIDIA NIM and vLLM servers.
Context and background
AI agents are increasingly deployed from edge to cloud, where sophisticated reasoning and iterative planning are required to autonomously solve multi‑step problems. To maximize performance in edge environments, models must be both accurate and efficient. The Nemotron family is designed to support open weights, open datasets, and training techniques for enterprise‑level reasoning. Nemotron Nano 2 9B is the newest Nano model in this family and is purpose‑built for enterprise‑grade reasoning and agentic AI. It introduces a configurable thinking budget that lets developers dial how much internal reasoning the model does, and it uses a hybrid Transformer–Mamba backbone to raise throughput while preserving accuracy. This combination makes it suitable for PC and edge footprints and cost control. NVIDIA positions the Nemotron family as a platform for open science and practical deployments, encouraging developers to use parts or the whole of Nemotron to fit their use cases. The model is described as leading in accuracy within its size category across reasoning tasks such as math, coding, science, and more, while retaining effective instruction following and tool calling. The architecture is paired with a high‑throughput design that helps sustain reasoning tokens at low latency in edge environments.
What’s new
Nemotron Nano 2 9B introduces a number of capabilities and design choices intended to improve edge reasoning. It is an open Nano model in the Nemotron family, aimed at enterprise‑grade reasoning and agentic AI. The hybrid Transformer–Mamba backbone is central to its performance, combining a majority of Mamba‑2 selective state‑space modules that run in linear time with constant memory per token and a few Transformer islands that preserve global content‑based jumps. This enables higher tokens per second while controlling memory use. A notable feature is the configurable thinking budget, which lets users cap internal reasoning and can reduce token generation when appropriate. The model training flow includes supervised fine‑tuning SFT on a balanced mix of reasoning and non‑reasoning data spanning mathematics, science, programming, tool use, general conversation, and safety. This is followed by focused reinforcement learning and preference‑based optimization to ensure alignment across a broad task spectrum. The distillation step begins from the 12B base model NVIDIA‑Nemotron‑Nano‑12B‑v2‑Base, which was post‑trained and aligned for various tasks. The 12B model consumes 22.9 GiB of memory for its weights in bf16, which exceeds the 22 GiB memory capacity of the NVIDIA A10G GPU. To fit the Nano 2 memory budget, NVIDIA applied pruning to derive a 9B parameter model and ensured the compressed model can run within the A10G memory with a 128k context. The memory budget for the compressed model is set to 19.66 GiB, leaving a 5% cushion for frameworks like vLLM and about 1.3 GiB for a vision encoder. The development team extended the Minitron model compression framework with a Neural Architecture Search NAS module to search for the best architecture under the memory budget. They reduced depth from 62 to 56 and performed width pruning across embedding channels, FFN dimension, and Mamba heads. After pruning, logit‑based knowledge distillation from the original 12B teacher was used to produce the 9B Nano 2 model. You can read more in the technical report if you want the full details. The thinking budget is implemented by inserting a special tag such as to indicate the model should stop thinking. This feature lets developers keep accuracy high while meeting response‑time targets, which is particularly useful for customer support, autonomous agent steps, and edge devices where latency matters. On the client side, sample code demonstrates restricting the budget to 32 tokens when connecting to a vLLM server. The model is designed to work in two thinking modes: ON with a reasoning chain‑of‑thought and OFF with no thinking tokens, with ON enabled by default. The overall message is that Nano 2 9B achieves best‑in‑class accuracy for its size and offers substantially higher throughput than similar open models, while enabling significant inference cost savings through the thinking budget. NVIDIA also opened several technical artifacts, including post‑training and pre‑training datasets.
Why it matters (impact for developers/enterprises)
For developers and enterprises, the Nemotron Nano 2 9B represents an open, hardware‑friendly path to deploying reasoning agents at scale. The model’s open weights and training artifacts lower barriers to adaptation, while the hybrid architecture delivers high throughput needed for low‑latency decisions in edge environments. The 128k context support and memory budgeting enable runs on GPUs like the NVIDIA A10G within their memory constraints, which is critical for long thinking traces and complex multi‑step tasks. The ability to adjust thinking budget helps providers balance accuracy, latency, and token costs, potentially driving meaningful reductions in inference spend. The openness of the Nemotron family aligns with a broader move toward open science and practical enterprise deployments, and NVIDIA notes that it is designed to bolster the open‑source community through open datasets and training techniques, with deployment pathways via NVIDIA NIM and vLLM servers. The performance claim of 6x higher throughput versus the next best open model in the same size class is a key differentiator for edge AI workflows that require rapid reasoning, such as RAG lookups, tool use, and real‑time decision making. The ability to run 128k tokens of context reduces the need for frequent off‑device interactions, further cutting latency. Enterprises also benefit from the thinking budget as a way to right‑size the model for their domain, aligning response time targets with accuracy expectations. The combination of high throughput, lower per‑token cost, and open accessibility makes Nano 2 9B a compelling option for teams building agentic AI systems and deploying them on edge devices or private data centers. The official NVIDIA blog post serves as the primary source for these claims.
Technical details or Implementation
Architecture and memory efficiency
Nemotron Nano 2 9B uses a Hybrid Transformer–Mamba backbone designed for reasoning‑heavy, long‑output workloads. Most layers are Mamba‑2 selective state‑space modules that run in linear time and maintain constant memory per token because they do not accumulate a growing KV‑cache. Interleaved among them are a small number of attention islands that preserve the Transformer strength in content‑based global jumps, useful for linking distant facts or instructions. In practice, the hybrid approach preserves Transformer‑grade accuracy while leaning on Mamba for throughput gains. The model supports a 128k context window and is engineered to fit within the A10G memory limits with a precise budget of 19.66 GiB for weights and associated buffers.
Post‑training process and distillation
The Nano 2 training flow includes supervised fine‑tuning on a balanced mixture of reasoning and non‑reasoning data spanning mathematics, science, programming, tool use, general conversation, and safety. Post‑training refinement continues with focused reinforcement learning and preference‑based optimization to ensure alignment across a broad task spectrum. The distillation step begins from the 12B base model NVIDIA‑Nemotron‑Nano‑12B‑v2‑Base, which is pruned and then retrained through logit‑based knowledge distillation to produce the 9B Nano 2 model. The 12B teacher model consumes 22.9 GiB of memory in bf16, which exceeds the target hardware memory, motivating the pruning and distillation pipeline.
Model compression and memory budgeting
To produce Nano 2, NVIDIA employs pruning along several axes depth, embedding channels, FFN dimension, and Mamba heads. The depth is reduced from 62 to 56 layers after an NAS‑driven search, followed by width pruning to locate the best configuration at that depth. A two‑stage distillation process recovers performance from the teacher. The final memory budget is 19.66 GiB with a 5% margin and 1.3 GiB reserved for a vision encoder, reflecting an end‑to‑end design that fits on typical edge hardware. This approach aims to deliver higher tokens‑per‑second than pure Transformer models while maintaining accuracy.
Inference considerations and client integration
The model is designed for speed and efficiency in reasoning applications, including low‑latency scenarios common to edge deployments. The team demonstrates a vLLM server integration and provides a client example showing how to apply a thinking budget, for instance restricting thinking to 32 tokens. Two thinking modes exist: Reasoning ON with a chain of thought and Reasoning OFF that returns the final answer without thinking tokens. The design also emphasizes cost control, with the potential to reduce inference costs by up to 60% when using the thinking budget. The project makes the thinking budget an integral part of the downloadable NIM artifact, enabling developers to tailor accuracy and latency to their domain needs.
Key specifications (at a glance)
| Item | Detail |
|---|---|
| Model size | 9B Nano 2 (distilled from 12B teacher) |
| Base architecture | Hybrid Transformer–Mamba backbone |
| Pruned layers | 56 (from 62) |
| Context window | 128k |
| Memory budget | 19.66 GiB for weights and buffers |
| Vision encoder RAM | 1.3 GiB |
| Throughput relative to peers | 6x higher than the next best open model in its size class |
| Thinking budget effect | Up to 60% inference cost reduction potential |
Key takeaways
- Nano 2 9B is the newest open Nano model in NVIDIA s Nemotron family, optimized for enterprise reasoning and edge deployments.
- The Hybrid Transformer–Mamba backbone delivers Transformer accuracy with Mamba driven throughput and linear‑time operation per token.
- The model uses a configurable thinking budget to right‑size accuracy, latency, and cost, supported by a post‑training and distillation pipeline from a 12B teacher.
- It supports 128k context inference and is designed to fit within A10G memory budgets with a dedicated memory plan that includes buffers for frameworks like vLLM.
- NVIDIA open sourced multiple technical artifacts and aims to make the Nemotron family broadly accessible to developers via open datasets and NVIDIA NIM deployments.
FAQ
-
What is Nemotron Nano 2 9B?
It is an open Nano model in the Nemotron family built for enterprise‑grade reasoning, with a Hybrid Transformer–Mamba backbone and a configurable thinking budget.
-
What is the thinking budget?
user defined limit for internal reasoning that can reduce token generation and inference costs while preserving accuracy; it is controlled by a client side tag like .
-
How does the model achieve high throughput?
The Hybrid Transformer–Mamba backbone uses most layers as Mamba‑2 selective state‑space modules with linear time operation and constant per‑token memory, plus a few attention islands that retain global context.
-
What are the memory and hardware constraints?
The 12B teacher uses 22.9 GiB; Nano 2 is compressed to 9B to fit a 19.66 GiB budget on A10G, with 1.3 GiB for a vision encoder and 128k context support.
-
How can developers access and deploy Nano 2?
NVIDIA plans open weights and datasets and deployment paths via NVIDIA NIM and vLLM; you can read the official post and try the model at build.nvidia.com.
References
More news
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
NVIDIA Dynamo offloads KV Cache from GPU memory to cost-efficient storage, enabling longer context windows, higher concurrency, and lower inference costs for large-scale LLMs and generative AI workloads.
Microsoft to turn Foxconn site into Fairwater AI data center, touted as world's most powerful
Microsoft unveils plans for a 1.2 million-square-foot Fairwater AI data center in Wisconsin, housing hundreds of thousands of Nvidia GB200 GPUs. The project promises unprecedented AI training power with a closed-loop cooling system and a cost of $3.3 billion.
Reddit Pushes for Bigger AI Deal with Google: Users and Content in Exchange
Reddit seeks a larger licensing deal with Google, aiming to drive more users and access to Reddit data for AI training, potentially via dynamic pricing and traffic incentives.
Monitor Amazon Bedrock batch inference using Amazon CloudWatch metrics
Learn how to monitor and optimize Amazon Bedrock batch inference jobs with CloudWatch metrics, alarms, and dashboards to improve performance, cost efficiency, and operational oversight.