9/20. AI Supercluster: Networking Convergence, InfiniBand, and Converged Ethernet

Sep 29, 2024

Introduction

In Article 8, we explored the network fabric underpinning AI superclusters, focusing on technologies like NVLink, NVSwitch, InfiniBand, and RDMA. These technologies, especially in NVIDIA's DGX systems, have established a foundation for efficient, high-speed communication in AI clusters. This article examines Ethernet's position in this landscape, its evolution, convergence with high-performance computing (HPC) standards, and its potential to rival or complement InfiniBand within AI infrastructure. We'll ground this discussion in NVIDIA's networking products, exploring how they enable high-throughput, low-latency data transfer crucial for training trillion-parameter language models in AI superclusters.

Historical Context of InfiniBand and Ethernet

Why Did NVIDIA Embrace InfiniBand?

In the early 2000s, high-performance computing (HPC) and AI research faced a significant challenge: network latency. This latency limited distributed computing applications that required fast, synchronous communication across nodes. Overcoming this challenge drove the development of InfiniBand.

Latency: The Critical Bottleneck in Early Ethernet

In the early 2000s, Gigabit Ethernet (1 Gbps) was the best available Ethernet technology. While this bandwidth sufficed for general-purpose data center applications, it struggled to provide the low latency necessary for HPC workloads. Latencies in early Ethernet networks typically ranged from 30 to 100 microseconds, resulting from Ethernet's packet-based design, CPU involvement in data transfers, and lack of advanced congestion management.

This high latency directly impacted the speed and efficiency of distributed computing tasks, such as the all-reduce operations critical to AI model training. InfiniBand, introduced in 2001 by the InfiniBand Trade Association (IBTA).Founding members of the IBTA included industry giants such as Intel, IBM, Hewlett-Packard (HP), Dell, Sun Microsystems, Microsoft, Oracle, Mellanox Technologies, and Cisco.

InfiniBand's initial versions offered latencies of less than 10 microseconds, significantly outperforming Gigabit Ethernet. This reduction in communication delays enabled synchronous operations across large numbers of compute nodes, improving the scalability of AI training. InfiniBand's ability to deliver ultra-low latency and lossless data transfer set it apart.

Mellanox Technologies played a leading role in bringing to market, RDMA (Remote Direct Memory Access), a defining feature of the InfiniBand architecture, allowing direct memory-to-memory data transfers without CPU intervention.

InfiniBand's sub-10-microsecond latency, enabled by RDMA, priority flow control, and hardware-level congestion management, established it as the preferred network fabric for large-scale AI workloads. In distributed AI training, where frequent and real-time synchronization of gradient data is essential, RDMA facilitated the near-instantaneous communication required to efficiently scale training across thousands of GPUs.

Role of Mellanox in Leading InfiniBand Early Development

By the early 2000s, Mellanox introduced early InfiniBand products, including switches, host channel adapters (HCAs), and network interface cards (NICs). Initially, Mellanox's InfiniBand products offered up to 2.5 Gbps of bandwidth. By the time NVIDIA acquired Mellanox in 2020 (for $6.9 billion!), the company had introduced solutions like Quantum-2 switches, delivering up to 400 Gbps per link with latencies approaching 1 microsecond—significantly outpacing Ethernet.

Ethernet's Evolution: From Standard to Converged

What is Converged Ethernet and Its Benefits?

While InfiniBand emerged for low-latency networking, Ethernet technology continued to evolve. Over the past two decades, Ethernet has made significant progress, particularly with the development of Converged Ethernet. Traditional Ethernet, though known for its versatility and cost-effectiveness, faced performance bottlenecks, congestion issues, and high latency in data-intensive applications. Converged Ethernet integrates enhancements like lossless packet delivery, priority flow control, and RDMA support to make Ethernet viable for high-performance, low-latency communication in AI and HPC environments.

Is Converged Ethernet an Open Standard?

Converged Ethernet is based on open standards, including IEEE and IETF specifications for Data Center Bridging (DCB), RoCE, and other Ethernet enhancements. This openness has driven widespread adoption and interoperability across different vendors, allowing flexibility and ease of integration into various network environments.

Who Pioneered Converged Ethernet?

Converged Ethernet's development involved collaboration among major networking companies, including Cisco, Broadcom, and Mellanox (now part of NVIDIA). NVIDIA is a proponent of Converged Ethernet. Their networking products, including Spectrum Ethernet switches and ConnectX SmartNICs, support Converged Ethernet, enabling high-performance networking for AI workloads.

How Ethernet Fits into the AI Supercluster Network Fabric

Ethernet vs. InfiniBand for All-Reduce Operations

The all-reduce operation, a critical communication pattern in distributed training, illustrates the role of Ethernet in AI superclusters. InfiniBand has traditionally excelled in such operations due to its ultra-low latency and deterministic performance. NVIDIA's Quantum-2 InfiniBand switches, featured in DGX H100 and GH200 systems, support direct memory-to-memory data transfers at 400 Gbps with sub-1-microsecond latency, which is particularly beneficial for training large language models.

However, Converged Ethernet, particularly with the support of RoCE, has made significant advancements. NVIDIA's Spectrum switches and ConnectX SmartNICs provide the bandwidth and latency characteristics required to handle the massive data flows of all-reduce operations. Modern Converged Ethernet can achieve latencies as low as 2-3 microseconds, making it suitable for large-scale gradient synchronization, though InfiniBand still maintains an edge in tightly-coupled HPC applications.

How Does Converged Ethernet Support Efficient Cross-Node Communication?

Converged Ethernet, using technologies like RoCE, facilitates direct GPU-to-GPU communication similarly to InfiniBand's RDMA capabilities. In NVIDIA's Spectrum-X Ethernet platform, ConnectX SmartNICs support RoCEv2, allowing for zero-copy, kernel-bypass communication. This reduces CPU involvement, minimizing latency and maximizing throughput. For cross-node communication in superclusters, RoCE over Converged Ethernet offers a viable alternative to InfiniBand, especially when considering cost, interoperability, and network flexibility. This capability is particularly relevant for scaling out the training of large language models across multiple nodes in an AI supercluster.

Impact of Ethernet's Scalability and Flexibility on AI Clusters

Ethernet's flexibility in network design makes it an attractive option for large AI clusters. While InfiniBand excels in dedicated HPC environments, Ethernet's versatility allows it to fit into heterogeneous settings, supporting both HPC and general-purpose workloads. This flexibility can be particularly advantageous in HPC research environments or hyperscaler environments where workloads may vary.

Comparing Converged Ethernet and InfiniBand

Benefits and Trade-Offs: Converged Ethernet vs. InfiniBand

Bandwidth and Latency: InfiniBand leads in raw latency performance, offering around 1 microsecond with Quantum-2 switches, while Converged Ethernet achieves 2-3 microseconds using RoCEv2. However, both provide comparable bandwidths up to 400 Gbps. The choice between these technologies can significantly impact the training time of large language models, where every microsecond of latency can accumulate over billions of training iterations.
Scalability and Flexibility: Ethernet's openness and support for various topologies provide a flexibility advantage. InfiniBand, optimized for HPC, often requires a specialized infrastructure. Ethernet's compatibility with existing networks simplifies integration, which can be a crucial factor when scaling up AI infrastructure.
Cost Considerations: Ethernet components, such as Spectrum switches and ConnectX SmartNICs, are generally more cost-effective and widely available than their InfiniBand counterparts. This cost-efficiency, combined with Ethernet's flexibility, makes it appealing for organizations building or expanding AI superclusters, potentially allowing for larger deployments within a given budget.

How Does RoCE (pronounced “Rocky”) Compare to InfiniBand RDMA?

RoCE (RDMA over Converged Ethernet) replicates InfiniBand's RDMA functionality over Ethernet networks, allowing direct memory access across nodes and reducing latency in data transfers.

NVIDIA's ConnectX SmartNICs fully support RoCEv2, enabling high-throughput, low-latency communication suitable for AI workloads. InfiniBand RDMA still provides a more predictable latency profile, which can be crucial in tightly-coupled, high-performance applications such as distributed training of large language models.

Ultra Ethernet Consortium (UEC) and the Future of Ethernet

What is the UEC and Its Vision for Ultra Ethernet Transport (UET)?

The Ultra Ethernet Consortium (UEC) aims to advance Ethernet's capabilities to support the growing demands of AI and HPC. Ultra Ethernet Transport (UET) is envisioned to enhance Ethernet's bandwidth, latency, and congestion management to create a fabric capable of supporting future AI models with tens or even hundreds of trillions of parameters. UEC members include NVIDIA, AMD, Broadcom, Cisco, Meta, Microsoft, Google, and others.

UET introduces advanced congestion control, enhanced RDMA capabilities, and higher port densities in Ethernet fabrics. For AI superclusters, UET represents an Ethernet network fabric that approaches InfiniBand's performance while offering greater flexibility and cost-effectiveness. These advancements could potentially reduce the performance gap between Ethernet and InfiniBand, making Ethernet an increasingly viable option for high-performance AI workloads.

Practical Considerations for AI Infrastructure Engineers

Roles of ConnectX SmartNICs and BlueField DPUs in InfiniBand and Converged Ethernet Networks

NVIDIA's ConnectX SmartNICs and BlueField Data Processing Units (DPUs) enhance the performance of both InfiniBand and Converged Ethernet networks within AI superclusters. These components facilitate high-speed data transfers, reduce CPU load, and offload network-intensive tasks, providing flexible support for the specific requirements of AI workloads.

ConnectX SmartNICs: Bridging InfiniBand and Converged Ethernet

ConnectX SmartNICs support both InfiniBand and Converged Ethernet protocols, providing a unified solution for AI infrastructure engineers. With support for port speeds up to 400 Gbps, these SmartNICs enable high-throughput, low-latency communication crucial for distributed AI training across large GPU clusters.

InfiniBand Support: In InfiniBand mode, ConnectX SmartNICs leverage native Remote Direct Memory Access (RDMA) capabilities, allowing direct GPU-to-GPU memory transfers without CPU involvement and achieving sub-1-microsecond latencies. This direct access is essential for tightly coupled applications and real-time data synchronization in AI superclusters.
Converged Ethernet Support: For Converged Ethernet environments, ConnectX SmartNICs utilize RDMA over Converged Ethernet (RoCE) to replicate InfiniBand's low-latency, high-bandwidth characteristics. RoCE enables zero-copy, kernel-bypass communication, reducing latency to as low as 2-3 microseconds. By supporting features like priority flow control and congestion management, ConnectX SmartNICs ensure smooth, lossless data transfer across Ethernet networks, making them suitable for large-scale AI training and inference tasks.

In both modes, ConnectX SmartNICs offload critical network processing tasks, reducing CPU load and freeing system resources for AI computation. This offloading is particularly beneficial in environments where network congestion or CPU overhead could otherwise hinder performance.

BlueField DPUs: Intelligent Controllers for Advanced Network Offloading

BlueField DPUs support both InfiniBand and Converged Ethernet protocols, providing seamless interoperability across different network fabrics. BlueField DPUs extend the functionality of ConnectX SmartNICs by acting as intelligent controllers for network traffic. These DPUs integrate advanced processing capabilities directly onto the network interface card, enabling a range of functions that enhance the performance and security of both InfiniBand and Converged Ethernet networks.

In InfiniBand environments, BlueField DPUs leverage hardware-accelerated RDMA to minimize latency and maximize throughput, ensuring rapid exchange of data between GPUs. In Converged Ethernet networks, they utilize RoCE to facilitate high-speed data transfers, incorporating advanced congestion control to maintain performance even under heavy loads.

Unifying Network Fabrics with NVIDIA's SmartNICs and DPUs

By supporting both InfiniBand and Converged Ethernet protocols, ConnectX SmartNICs and BlueField DPUs enable a unified networking fabric within AI superclusters. This dual support allows AI infrastructure engineers to design networks that balance the ultra-low latency of InfiniBand with the flexibility and interoperability of Converged Ethernet. Together, these products facilitate seamless data transfer, offload CPU-intensive tasks, and maintain high performance across diverse AI workloads.

NVIDIA's Spectrum-X and InfiniBand Quantum Switches: Parallel Product Lines Supporting Converged Ethernet and InfiniBand

NVIDIA's networking products are designed to operate seamlessly within diverse data center environments. Recognizing that AI workloads have varying network requirements, NVIDIA has developed two parallel switch product lines: Spectrum-X for Converged Ethernet and InfiniBand Quantum for traditional InfiniBand networks. These product lines offer equivalent levels of performance, scalability, and flexibility, enabling AI infrastructure engineers to choose the best networking fabric for their specific needs.

Spectrum-X Ethernet Switches: High Bandwidth and Low Latency for Converged Ethernet

NVIDIA's Spectrum-X family of Ethernet switches represents an advanced solution for modern data centers requiring high-bandwidth, low-latency networking. With support for speeds up to 400 Gbps per port, Spectrum-X switches are designed for data-intensive AI applications, facilitating massive data flows while maintaining low latencies, which are critical for tasks like all-reduce operations in AI training.

A key feature of the Spectrum-X platform is its support for Remote Direct Memory Access over Converged Ethernet (RoCE). By leveraging RoCE, Spectrum-X switches allow data to bypass the CPU during transfers, reducing processing overhead, minimizing latency, and maximizing throughput. This capability is crucial in large AI clusters where direct GPU-to-GPU communication across nodes is necessary for rapid gradient synchronization and model training.

Additionally, Spectrum-X switches support advanced networking functions such as congestion management, adaptive routing, telemetry, and priority flow control, further enhancing their suitability for complex AI workloads. These features make Spectrum-X switches a versatile choice for data centers that require a flexible network fabric capable of scaling with both AI and general-purpose computing tasks.

InfiniBand Quantum Switches: Ultra-Low Latency and High Throughput for Traditional InfiniBand Networks

In parallel with its Spectrum-X line, NVIDIA offers the InfiniBand Quantum switch family, optimized for ultra-low latency and high-throughput networking in HPC and AI environments. The latest in this line, Quantum-2 switches, provide port speeds of up to 400 Gbps, matching the bandwidth capabilities of the Spectrum-X Ethernet switches. Quantum switches are designed to fully leverage InfiniBand's native RDMA capabilities, achieving sub-1-microsecond latencies for extremely tight coupling in distributed training workloads.

NVIDIA's InfiniBand switches incorporate advanced features like hardware-level congestion management, end-to-end Quality of Service (QoS), and Adaptive Routing, ensuring efficient data transfer even under high-load conditions. These switches excel in environments where deterministic, low-latency communication is paramount, such as in large-scale AI superclusters running tightly-coupled applications.

Product Parallels: Spectrum-X and InfiniBand Quantum

NVIDIA provides parallel paths between its Ethernet and InfiniBand switch offerings:

Bandwidth: Both Spectrum-X and InfiniBand Quantum switches offer similar maximum port speeds of up to 400 Gbps, ensuring that engineers can achieve high throughput regardless of the protocol they choose.
Latency: While Spectrum-X Ethernet switches, with the use of RoCE, achieve latencies as low as 2-3 microseconds, Quantum-2 InfiniBand switches provide even lower latencies, approaching the sub-1-microsecond mark. This comparison helps engineers select the appropriate switch depending on whether ultra-low latency or more generalized flexibility is the priority.
RDMA Support: Spectrum-X switches support RDMA over Converged Ethernet (RoCE), allowing direct memory access similar to InfiniBand's native RDMA. In contrast, InfiniBand Quantum switches offer native RDMA with hardware acceleration, delivering the lowest possible latency for RDMA operations. This difference may influence the choice of switch depending on the workload's sensitivity to latency and CPU involvement in data transfers.

NVIDIA Invests in Both Technologies

NVIDIA's development of both Spectrum-X and InfiniBand Quantum product lines indicates the company's recognition that both Converged Ethernet and InfiniBand have important roles to play in AI infrastructure. While InfiniBand continues to lead in environments where ultra-low latency and deterministic performance are crucial, Converged Ethernet has evolved into a strong alternative for applications that benefit from its flexibility, cost-effectiveness, and wide interoperability.

By developing these parallel product lines, NVIDIA continues to advance both Ethernet and InfiniBand networking technologies. This approach provides AI infrastructure engineers with a range of tools, allowing them to architect network fabrics tailored to their specific workloads and performance requirements. Whether an AI cluster prioritizes the flexibility and broader ecosystem support of Ethernet or the ultra-low-latency performance of InfiniBand, NVIDIA's portfolio of switches is designed to support these needs at high levels of throughput and efficiency.

How to Choose: Spectrum-X or InfiniBand Quantum?

For AI infrastructure engineers, the choice between Spectrum-X and InfiniBand Quantum often comes down to workload characteristics and infrastructure priorities. If the primary concern is minimizing latency to the lowest possible levels, InfiniBand Quantum switches are often the preferred choice. Conversely, if flexibility, integration with existing Ethernet networks, or support for a wider variety of traffic types is a priority, then Spectrum-X switches provide a competitive solution. Both product lines offer comparable bandwidth capabilities, leaving the decision to hinge on latency requirements, networking protocol compatibility, and deployment preferences.

By developing both technologies, NVIDIA aims to provide networking solutions for various AI supercluster configurations, addressing the diverse needs of modern AI workloads. This approach reflects an understanding that there is no one-size-fits-all solution in AI networking, and provides AI infrastructure engineers with options to build scalable, efficient, and high-performance AI networks.

Conclusion: Ethernet in the Unified Network Fabric

As AI superclusters evolve, Ethernet—particularly in its converged form—plays an increasingly important role. While InfiniBand retains its lead in certain performance aspects, Ethernet's improvements in latency, bandwidth, and cost-effectiveness make it a strong contender. NVIDIA's Spectrum-X Ethernet platform, along with technologies like RoCE and BlueField DPUs, illustrates Ethernet's potential to complement or, in some cases, rival InfiniBand in AI supercluster environments.

The convergence of Ethernet technologies, ongoing innovations from the UEC, and products like NVIDIA's Spectrum-X contribute to Ethernet's position as an option for future AI infrastructure. As AI models grow larger and demand more efficient, flexible networking, Ethernet's role in the network fabric is likely to become more prominent. This evolution in networking technology will play a crucial role in enabling the next generation of AI superclusters, capable of training and deploying increasingly complex models with trillions of parameters.

Looking ahead, the continued development of both Ethernet and InfiniBand technologies will be critical in meeting the escalating demands of AI workloads. As we move towards exascale computing and beyond, the choice of networking fabric will remain a key consideration for AI infrastructure engineers, balancing the need for ultra-low latency, high bandwidth, scalability, and cost-effectiveness. The next article in this series, Article 10, “Overcoming Communications Bottlenecks," will explore how these networking technologies address the communication challenges such as bandwidth and latency posed by large-scale AI training and inference workloads.

SUPERCLUSTER

Discussion about this post