8/20. AI Supercluster: Traversing the Network Fabric

Tracing the All-Reduce Operation

Sep 25, 2024

Introduction: Understanding Network Fabric Through All-Reduce
Overview of the All-Reduce Operation
Starting the Journey: GPU A and Intra-Node Communication
Preparing for Cross-Node Travel: GPUDirect RDMA and Network Interfaces
The Long-Haul: Cross-Node Communication via InfiniBand and RDMA
Traffic Control: Switching Fabric and Routing Between Nodes
Arriving at the Destination: Receiving Data at GPU B
Putting It All Together: Unified Network Fabric in Action
Conclusion: Reflecting on Our Journey from GPU A to GPU B

Introduction:Understanding Network Fabric Through All-Reduce

Training large language models (LLMs) with trillions of parameters requires an intricate network of high-performance computing resources. This article explores how technologies like NVLink, NVSwitch, InfiniBand, RDMA, NCCL, and GPUDirect work together to enable efficient, scalable LLM training across massive AI clusters.

The scale of modern LLMs, often encompassing billions or even trillions of parameters, necessitates distributed training techniques across hundreds or thousands of GPUs. This distributed approach introduces a critical challenge: efficient communication between GPUs to synchronize model updates.

To comprehensively understand the complex network fabric that enables this massive-scale training, we'll use the all-reduce operation as our guide. The all-reduce operation is a fundamental component of distributed training, responsible for synchronizing gradients across all GPUs to ensure consistent model updates. We've chosen this operation for several reasons:

Ubiquity: All-reduce occurs frequently during training, making it a representative example of network traffic in AI clusters.
Comprehensive coverage: It involves both intra-node and inter-node communication, allowing us to explore all levels of the network fabric.
Performance critical: The efficiency of all-reduce directly impacts training speed, making it a key focus for optimization efforts.

Throughout this article, we'll trace the journey of a single gradient update as it travels from GPU A in one NVIDIA GB200 NVL72 node to GPU B in another NVIDIA GB200 NVL72 node during an all-reduce operation. This trace will serve as a throughline, allowing us to examine each component of the network fabric in context:

We'll start within a single node, exploring how NVLink and NVSwitch facilitate high-speed communication between GPUs.
We'll then follow the data as it moves to the node's network interface, examining the role of technologies like GPUDirect RDMA and the BlueField-3 SuperNIC.
From there, we'll trace the inter-node communication path, going into the capabilities of NVIDIA InfiniBand Quantum InfiniBand network fabric.
Finally, we'll see how the data arrives at GPU B, completing our end-to-end journey through the network fabric.

By following this concrete example, we'll gain a practical understanding of how each technology contributes to the overall performance of the AI supercluster. This approach will illuminate not just the individual components, but also how they interact to create a unified, high-performance network capable of supporting the most demanding AI workloads.

Let's begin our journey through the network fabric, tracing the path of our gradient update from GPU A to GPU B, and uncover the intricate technologies that make training trillion-parameter models possible.

2. Overview of the All-Reduce Operation

Before we embark on our journey from GPU A to GPU B, let's understand the all-reduce operation and its significance in LLM training.

The All-Reduce Operation Explained

The all-reduce operation is a collective communication process used during the training of large models to synchronize the gradients—the adjustments to model parameters calculated by each GPU during backpropagation. After each GPU computes its gradients based on the training data it processes, these gradients must be shared and summed across all GPUs to ensure the model's parameters are updated consistently.

Why All-Reduce is Vital for Massive Model Training

For a trillion-parameter model, each GPU calculates gradients that can reach gigabytes in size per training step. The forward pass (where predictions are made) and the backpropagation/backward pass (where gradients need to be computed and synchronized) must be efficient, and high-bandwidth communication is necessary to avoid delays during the all-reduce operation. Any lag in synchronizing gradients can significantly slow down the overall training time.

Networking Requirements for LLM Training

Large-scale LLM training depends on fast and efficient communication between GPUs. The all-reduce operation requires each GPU to share its computed gradients with every other GPU in the cluster, regardless of their location within the same node or across different nodes.

Key requirements for networking in LLM training include:

High bandwidth to handle the large volume of data transferred during gradient synchronization
Low latency to ensure training iterations proceed quickly, minimizing wait times for data traversal
Scalability to maintain efficient communication as the number of GPUs increases to hundreds or thousands

As we trace the path of a gradient update from GPU A to GPU B, we'll see how various technologies address these requirements.

3. Starting the Journey: GPU A and Intra-Node Communication

Our journey begins with GPU A, located within a NVIDIA GB200 NVL72 node. Before the gradient data can be sent to GPU B in another node, it first needs to interact with other GPUs within its own node. This is where NVLink and NVSwitch come into play.

NVLink: The Foundation of Intra-Node Communication

NVLink technology has evolved to meet the increasing demands of LLM training:

DGX H100 systems: NVLink 3.0 offers up to 600 GB/s of bidirectional bandwidth
GB200 NVL72 systems: NVLink 4.0 offers up to 900 GB/s of bidirectional bandwidth

This enhancement allows GPU A to share its gradients with other GPUs in the same node more efficiently, reducing the likelihood of bottlenecks during intra-node synchronization.

NVSwitch: Optimizing Multi-GPU Communication

NVSwitch technology in the GB200 NVL72 systems provides significant improvements:

DGX H100: NVSwitch 2.0 connects all eight GPUs in the node
GB200 NVL72: NVSwitch 3.0 provides twice the bandwidth per GPU

With a combined 4.5 TB/s of bandwidth across the switch in DGX GH200 systems, NVSwitch 3.0 enables faster gradient aggregation within the node, further optimizing the all-reduce operation.

At this stage, GPU A has shared its gradient data with other GPUs in its node. The next step is to prepare this data for its journey to GPU B in another node.

4. Preparing for Cross-Node Travel: GPUDirect RDMA and BlueField-3 SuperNIC

As our gradient data from GPU A prepares to leave its node, it needs to be efficiently transferred to the network interface. This is where GPUDirect RDMA and the BlueField-3 SuperNIC play crucial roles.

GPUDirect RDMA: Streamlining GPU-to-GPU Communication

NVIDIA’s high-performance computing solutions, especially in AI and data center applications, heavily rely on efficient data transfer mechanisms to maximize GPU performance. Key technologies facilitating these high-speed data transfers include RDMA, GPUDirect RDMA, RoCE, and the combination of RoCE + GPUDirect RDMA.

1. RDMA (Remote Direct Memory Access)

RDMA is a networking protocol that allows direct memory access between two systems over a network, bypassing the CPU to minimize latency and free up processing resources. In the context of NVIDIA, RDMA enables rapid data exchange in high-performance computing environments, such as AI model training. RDMA is a general-purpose protocol and can be used for CPU-to-CPU communication, CPU-to-storage, or GPU-to-GPU data transfers.

2. GPUDirect RDMA

**GPUDirect RDMA** is an NVIDIA-specific technology that extends the capabilities of RDMA to GPU memory. It allows data to move directly between GPUs across different nodes in a network without involving the CPU or main system memory. This feature is crucial for distributed deep learning workloads, where multiple GPUs need to share data with minimal latency. For example, during model training, GPUs can directly exchange gradients or weights over a high-speed network, significantly enhancing overall throughput.

Supported Products:

NVIDIA GPUs: A100, H100, GH200, and GB200
NVIDIA InfiniBand Network Adapters: ConnectX-5, ConnectX-6, ConnectX-7, ConnectX-8
NVIDIA SuperNICs
NVIDIA InfiniBand and NVIDIA Quantum InfiniBand switches
NVIDIA BlueField DPUs

3. RoCE (RDMA over Converged Ethernet)

RoCE enables RDMA functionality over standard Ethernet networks, providing a cost-effective way to achieve low-latency, high-throughput data transfers similar to InfiniBand. In NVIDIA's networking products, RoCE is often used in data center environments to facilitate fast memory access, whether for server-to-server communication, accessing remote storage, or integrating with GPUs. However, RoCE by itself does not imply GPU-to-GPU communication; it is simply a protocol that can transport RDMA operations over Ethernet.

4. RoCE + GPUDirect RDMA

When RoCE is combined with GPUDirect RDMA, it becomes a powerful solution for direct GPU-to-GPU communication over Ethernet. This combination allows GPUs across different nodes to exchange data directly via Ethernet without involving the CPU, minimizing latency and maximizing data transfer efficiency. This setup is ideal for modern data center environments that leverage Ethernet infrastructure for their AI and high-performance computing workloads.

Supported Products:

NVIDIA GPUs: A100, H100, GH200, and GB200
NVIDIA InfiniBand Network Adapters: ConnectX-5, ConnectX-6, ConnectX-7, ConnectX-8
NVIDIA SuperNICs
NVIDIA Spectrum-X Ethernet Switches
NVIDIA BlueField DPUs

BlueField-3 SuperNIC: The Gateway to Inter-Node Communication

As our gradient data reaches the edge of GPU A's node, it encounters the BlueField-3 SuperNIC. This advanced network interface controller plays a crucial role in managing RDMA traffic, ensuring low-latency data transfers between GPU A and its destination, GPU B.

In GB200 NVL72 systems, the BlueField-3 SuperNIC supports the InfiniBand NDR 400 standard, providing up to 400 Gbps of bandwidth. This ensures that our gradient data can begin its inter-node journey at high speed, setting the stage for efficient cross-node communication.

The BlueField-3 SuperNIC also offers additional benefits:

Integrated ARM cores for offloading networking tasks
Hardware acceleration for AI and security workloads
Advanced virtualization capabilities

These features contribute to reducing the overall system load and improving the efficiency of inter-node communication during the all-reduce operation.

5. The Long-Haul: Cross-Node Communication via InfiniBand and RDMA

With our gradient data now at the network interface of GPU A's node, it's ready to begin its journey across nodes to reach GPU B. This is where InfiniBand and RDMA technologies come into play.

InfiniBand: The Highway for Inter-Node Communication

The evolution of InfiniBand technology has significantly improved cross-node communication:

DGX H100 systems: InfiniBand HDR provides up to 200 Gbps of bandwidth
DGX GH200 systems: InfiniBand NDR doubles the available bandwidth to 400 Gbps

This improvement effectively halves the time required for our gradient data to travel from GPU A's node to GPU B's node, directly addressing the cross-node communication bottleneck.

GPUDirect RDMA: Ensuring Efficient Data Transfer

GPUDirect RDMA allows our gradient data to be sent directly from GPU A's memory to GPU B's memory, reducing overhead and accelerating communication.

NCCL: Orchestrating the All-Reduce Operation

As our gradient data travels between nodes, the NVIDIA Collective Communications Library (NCCL) coordinates the overall all-reduce operation. NCCL is optimized for both NVLink and InfiniBand, selecting the most efficient communication path to minimize latency.

In DGX GH200 systems, NCCL leverages the improved bandwidth of NVLink 4.0, NVSwitch 3.0, and InfiniBand NDR to distribute gradients across nodes with reduced delay. This ensures that our gradient data, along with data from other GPUs, is efficiently combined and distributed as part of the all-reduce operation.

6. Traffic Control: Switching Fabric and Routing Between Nodes

As our gradient data travels from GPU A's node to GPU B's node, it needs to be efficiently routed through the network. This is where the switching fabric comes into play.

InfiniBand Quantum and Spectrum-X Switches: Handling Large-Scale Communication

The evolution of InfiniBand switching fabric has significantly improved cross-node communication:

DGX H100: InfiniBand HDR links provide 200 Gbps of bandwidth per port
DGX GH200: Quantum-2 InfiniBand switches support NDR links with 400 Gbps per port
And the future Quantum-X800 promises another 2X performance improvement to 800 Gb/s per port

This doubling of communication bandwidth between nodes every year or two accelerates the journey of our gradient data from GPU A to GPU B.

7. Arriving at the Destination: Receiving Data at GPU B

As our gradient data approaches its destination, it goes through a similar process to what it experienced when leaving GPU A, but in reverse.

BlueField-3 SuperNIC: The Receiving Gateway

When our gradient data reaches GPU B's node, it's first received by the node's BlueField-3 SuperNIC. The SuperNIC then uses GPUDirect RDMA to transfer the data directly into GPU B's memory, bypassing the host CPU and system memory. This ensures rapid delivery of our gradient data to its final destination.

The BlueField-3 SuperNIC's advanced features come into play here as well, potentially using its integrated ARM cores to manage the receiving process efficiently and leveraging its hardware acceleration capabilities to optimize the data transfer.

Intra-Node Distribution at the Destination

Once our gradient data is in GPU B's memory, it may need to be shared with other GPUs within the same node. This is again handled by NVSwitch 3.0, which efficiently distributes the received data among the node's GPUs. NVLink 4.0 facilitates this high-speed GPU-to-GPU communication within the node.

8. Putting It All Together: Unified Network Fabric in Action

Now that we've traced the journey of our gradient data from GPU A to GPU B, let's consider how all these technologies work together to create a unified network fabric.

The combination of NVLink, NVSwitch, InfiniBand, and the BlueField-3 SuperNIC creates a unified network fabric that facilitates efficient communication both within and across nodes in GB200 NVL72 systems. The increased bandwidth of NVLink 4.0, NVSwitch 3.0, InfiniBand NDR, and the advanced capabilities of the BlueField-3 SuperNIC further mitigate potential bottlenecks.

This unified fabric ensures that both intra-node and inter-node communication—whether between GPU A and its peers or between GPU A and GPU B—occurs rapidly and efficiently. As a result, all-reduce operations remain seamless even as clusters scale to thousands of GPUs, supporting the training of increasingly large and complex language models.

9. Conclusion: Reflecting on Our Journey from GPU A to GPU B

By tracing the path of an all-reduce operation between GPU A in NVIDIA GB200 NVL72 node 1 and GPU B in NVIDIA GB200 NVL72 node 2, we've seen how NVLink, NVSwitch, InfiniBand, RDMA, NCCL, GPUDirect, and the BlueField-3 SuperNIC work in concert to enable efficient, scalable LLM training. These technologies address critical bottlenecks in both intra-node and inter-node communication, paving the way for the development of even larger and more sophisticated language models.

Let's recap our journey through the network fabric:

Intra-Node Communication: We started with GPU A, where NVLink 4.0 provided a 900 GB/s bidirectional bandwidth for sharing gradient data with other GPUs in the same node. NVSwitch 3.0 facilitated this communication with a combined bandwidth of several TB/s across the switch.
Preparing for Cross-Node Travel: GPUDirect RDMA prepared the data for inter-node transfer, eliminating the need for CPU involvement and reducing latency. The BlueField-3 SuperNIC, supporting the InfiniBand NDR 400 standard, provided a 400 Gbps bandwidth gateway for our data to begin its inter-node journey.
Cross-Node Communication: As our data left GPU A's node, InfiniBand NDR technology provided a 400 Gbps bandwidth highway for efficient cross-node travel. RDMA ensured that the data could be sent directly from GPU A's memory to GPU B's memory, minimizing overhead.
Network Fabric Navigation: The Quantum-2 InfiniBand switches, with their 400 Gbps NDR links, efficiently routed our data through the network, ensuring rapid transit between nodes.
Arrival at Destination: Upon reaching GPU B's node, the process mirrored the sending operation. The BlueField-3 SuperNIC received the data and used GPUDirect RDMA to transfer it directly into GPU B's memory. NVSwitch 3.0 and NVLink 4.0 then facilitated the distribution of this data to other GPUs within GPU B's node if necessary.

Throughout this journey, NCCL orchestrated the entire all-reduce operation, ensuring that our gradient data was efficiently combined with data from other GPUs and distributed appropriately.

This end-to-end view demonstrates how each networking technology, including the advanced BlueField-3 SuperNIC, contributes to the overall efficiency of the all-reduce operation, and by extension, to the training of trillion-parameter models.

It's the orchestration of these myriad data transfers, enabled by the technologies we've explored, that makes the training of today's most advanced AI models possible. As we look to the future, the continued evolution of this network fabric, including advancements in technologies like the BlueField SuperNIC series, will be key to unlocking the next generation of AI capabilities, potentially enabling models with tens or even hundreds of trillions of parameters.

Next we go deeper into the fabric in Article 9, “Networking Convergence, InfiniBand, and Converged Ethernet”.

SUPERCLUSTER

Discussion about this post