Connecting Distributed Data Centers into Large AI Factories with Scale-Across Networking
Sources: https://developer.nvidia.com/blog/how-to-connect-distributed-data-centers-into-large-ai-factories-with-scale-across-networking, https://developer.nvidia.com/blog/how-to-connect-distributed-data-centers-into-large-ai-factories-with-scale-across-networking/, NVIDIA Dev Blog
TL;DR
- Spectrum-XGS Ethernet enables scale-across networking to connect distributed data centers into a single AI factory over long distances (beyond 500 meters).
- It uses NVIDIA Spectrum-X Ethernet platform hardware (Spectrum-X switches and ConnectX-8 SuperNICs) with telemetry-based congestion control and distance-aware adaptive routing to minimize latency.
- In NCCL tests at 10 km, Spectrum-XGS delivers up to 1.9x higher all-reduce bandwidth than off-the-shelf Ethernet, especially for large messages.
- The technology unifies data centers regardless of proximity, enhancing fungibility of AI infrastructure and enabling large-scale single-job training and disaggregated inference.
- It addresses latency and jitter issues associated with deep-buffer long-haul Ethernet, providing predictable performance for synchronous AI workloads. NVIDIA
Context and background
AI scaling is incredibly complex, and new techniques in training and inference are continually demanding more out of the data center. While data center capabilities are scaling quickly, data center infrastructure is subject to fundamental physical limitations that have no impact on algorithms and models. Power availability, cooling capacity, and space constraints place limits on the physical footprint of an AI factory. To continue growing, new data centers are built, and connectivity over distance becomes a factor in pooling these resources together to function in tandem on a single training or disaggregated inference workload. NVIDIA Traditionally, when connecting data centers together with long-haul Ethernet built from “off-the-shelf” merchant silicon, the principal objective was to ensure that data successfully made it to its destination. Because distances can be long and latencies high, the possibility for congestion is also high, and the impact can be extreme. To mitigate this challenge and prevent packets from being dropped, off-the-shelf Ethernet vendors create solutions where deep packet buffers, capable of absorbing large bursts of network traffic, are employed. While these deep buffer switches are a solution for long-haul service providers and telecoms, they introduce problems for AI. In particular, switches with deep buffers inherently suffer from higher latencies. In addition, when the buffer starts becoming full, it must “drain.” With respect to AI workloads, this occurrence is unpredictable, causing a large amount of jitter, or variance in data delivery. High latency and unpredictability from this shock-absorber technique becomes problematic for training and disaggregated inference performance, which are synchronous in nature and require predictable performance from the network. This post explains how NVIDIA Spectrum-XGS Ethernet for scale-across networking enables inter-data center connectivity with the high performance needed for AI. NVIDIA Scale-across networking is a new category of AI compute fabric connectivity that can be thought of as a new dimension, orthogonal to the existing connectivity options of scale-up and scale-out. With Spectrum-XGS Ethernet for scale-across networking, multiple data centers of varying sizes and distances can be unified as one large AI factory. For the first time, the network can deliver the performance needed for large-scale single job AI training and inference across geographically separated data centers. NVIDIA Spectrum-XGS Ethernet is a new technology addition to the NVIDIA Spectrum-X Ethernet platform. It is based on the same hardware combination of Spectrum-X Ethernet switches and ConnectX-8 SuperNICs, and leverages the same stack of software and libraries used for scale-out connectivity within the data center. With Spectrum-XGS Ethernet, the connectivity is between AI factories over long distances; that is, over 500 meters. This could mean connectivity between buildings in a campus, or over tens or hundreds of miles, across cities or even states and countries. To make scale-across connectivity feasible, the algorithms responsible for ensuring high effective bandwidth and performance isolation had to evolve. NVIDIA One of the challenges with moving data across long distances is the implication of increased latency—even for data traversing an optical fiber in the form of light. Data propagates across the glass strands at a rate of 5 nanoseconds per meter. This means that traveling 1 kilometer takes 5 microseconds. These numbers may seem small in absolute terms, but for GPU-to-GPU communication, every microsecond counts. Spectrum-XGS Ethernet features modified telemetry-based congestion control and adaptive routing algorithms that are optimized around the distance between communicating devices. Whenever a connection is initiated, the network notes whether the two devices are together inside the data center, or not. This helps the switch know the best approach to load balance for adaptive routing, and informs the SuperNIC to handle injection rate for congestion control. At the network level, this enables Spectrum-XGS Ethernet to holistically handle communications without incurring additional latency. NVIDIA Some of the key benefits of Spectrum-XGS Ethernet technology to scale-across networking include:
- To show the impact of NVIDIA Spectrum-XGS Ethernet on scale-across performance, NVIDIA engineers ran NCCL primitives across multiple sites at a distance of 10 km and compared the results to off-the-shelf Ethernet. The results were significant: NVIDIA Spectrum-XGS Ethernet delivers up to 1.9x higher NCCL all-reduce bandwidth over off-the-shelf Ethernet. The greatest speedup occurs with the larger message sizes, which are the most common with AI training workloads. These improvements to NCCL performance translate into faster job completion times for AI applications. NVIDIA
- Spectrum-XGS Ethernet enhances the fungibility of AI infrastructure. By introducing a technology that enables data centers to communicate over any distance without performance degradation, Spectrum-XGS Ethernet creates common architecture shared between scale-out and scale-across networking. Ethernet data centers built on Spectrum-XGS Ethernet can readily be combined together to act as one, regardless of proximity. This enables mission-critical AI infrastructure to pool resources and consistently deliver value for advanced AI workloads. NVIDIA To learn more about the technical innovations underpinning NVIDIA Spectrum-X Ethernet, see NVIDIA Spectrum-X Network Platform Architecture. NVIDIA
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.