One of the most overlooked elements by the public, the data center network is actually responsible for all communications between nodes. However, NVIDIA knows that data centers with millions of GPUs are on the horizon, and for the fastest AI models, they will need to be interconnected, even across multiple facilities. That’s why NVIDIA today introduced Spectrum-XGS Ethernet, an extension of the Spectrum-X networking platform designed to interconnect multiple geographically dispersed data centers into oneAI super factory.
The company says Spectrum-XGS eliminates the capacity limitations of a single facility by introducing distance-aware networking that delivers predictable, low-latency performance across campuses, cities and continents.

The technology is primarily delivered through software and firmware updates to existing Spectrum-X switches and ConnectX SuperNICs, rather than through new silicon. Spectrum-XGS provides self-adjusting congestion control optimized for long-distance links, precise latency management that minimizes jitter, and comprehensive end-to-end telemetry, allowing operators to visualize and control network traffic across multiple sites.
NVIDIA reports that these improvements nearly double NCCL (Collective Communication Library) throughput for multi-GPU, multi-node training jobs and large-scale experiments, making distributed AI workloads more efficient. NVIDIA positions Spectrum-XGS as a new axis of growth for AI infrastructure: following scaling within servers and scaling within data centers, cross-scale scaling connects facilities into a unified computing fabric.

超大规模运营商正准备采用这种方法。 CoreWeave 将成为首批将多个设施与 Spectrum-XGS 连接在一起的公司之一。 The company will use its distributed sites as a supercomputer, providing customers with greater aggregate capacity and streamlined operations for gigabit-scale experiments and production training runs.
Spectrum-XGS 是 Spectrum-X 平台的一部分,并在 Hot Chips 大会上进行了演示。更多细节预计将在 Hot Chips 大会上公布,但大规模、跨洲规模的训练运行已不再是空想。有了 Spectrum-XGS 这样的解决方案,只有天空(和电网)才是极限。