4/20. AI Supercluster: NVIDIA DGX Platform

Sep 10, 2024

Introduction

The NVIDIA DGX H100 is a high-performance 8 GPU-based system designed to accelerate AI computing at scale. This article provides a comprehensive overview of the DGX H100's hardware architecture and its implications for AI workloads. We go into the system's core hardware components, networking architecture, and capabilities, from standalone node to deployment in large-scale AI clusters. Note: while the focus is on the DGX H100 in this article, the high-level architecture applies as the platform has evolved to DGX H200 and the DGX B200.

What is the DGX H100?
Cost
Unboxing the DGX H100
Physical Characteristics
Core Components
Memory and Storage Hierarchy
DGX H100 Internal Fabric
Unboxing the Box: Inside Physical Layout
Reliability, Availability, and Serviceability (RAS)
Energy Requirements: Power, Efficiency, and Cooling
Scalability and Cluster Architecture
Rent or Buy?
Conclusion

What is the DGX H100?

The DGX H100 integrates 8 NVIDIA H100 GPUs, 2 CPUs, networking hardware, and storage into a 4U rackmount chassis. It is optimized for AI model training, inference, and deployment, and serves as a single node in larger supercomputing clusters.

Cost: ~$200K

Price varies. One source says:

Base price: $199,000

Support (1 year): Approximately $39,800 (20% of base price)

Total with 1-year support: Approximately $238,800.

That’s the list price. Some discount is typically applied.

Unboxing the DGX H100

The system arrives in a sturdy wooden crate, wrapped in protective packaging material, packed securely to withstand long journeys, even across international shipping routes.

Once uncrated, you'll see a sleek, industrial-grade black chassis ready for integration into a 19-inch data center rack. It measures 6.89 inches (175 mm) in height, 17.3 inches (440 mm) in width, and 31.4 inches (797 mm) in depth. Weighing in at 271.5 lbs (123.2 kg), it’s designed for immediate deployment into data center infrastructure.

Physical Characteristics

Form Factor: 4U rackmount chassis
Dimensions: 19" (W) x 7" (H) x 32.8" (D) (482.6 mm x 178 mm x 833.1 mm)
Weight: Approximately 271.5 lbs (123.2 kg)

Core Components

GPUs

8 NVIDIA H100 GPUs
80 GB HBM3 memory per GPU (3 TB/s memory bandwidth)
Up to 1,000 TFLOPS FP8 performance
18 NVLink 4.0 ports per GPU, providing aggregate 900 GB/s bidirectional bandwidth for intra-node GPU communication.

CPUs

2 AMD EPYC 9004 series, 64-core processors, 3.4 GHz max boost clock
256 MB L3 cache per CPU
The CPUs manage system-level tasks such as data orchestration and I/O operations and serve as a shared resource pool across all GPUs.

Memory & Storage

80 GB of HBM3 memory (on each H100 GPU)
2 TB DDR5-4800 ECC memory
30 TB NVMe SSD storage (8x 3.84 TB NVMe SSDs)

Networking

4 NVIDIA ConnectX-7 VPI adapters, each featuring dual 400 Gb/s ports, supporting both InfiniBand and Ethernet protocols.
2 NVIDIA NVSwitch devices for low-latency, intra-node GPU communication. Each NVSwitch supports up to 18 NVLink 4.0 connections.

Power Supply

4x 3000W redundant power supplies
200-240V AC input voltage
Maximum power consumption: 10.2 kW

Memory and Storage Hierarchy

The DGX H100 system implements a tiered memory and storage structure:

HBM3: Fast, on-chip memory for active computations. Each H100 GPU contains 80 GB of HBM3 memory with 3 TB/s bandwidth, which is essential for handling data-intensive AI computations.
DDR5 system memory: System memory acts as a high-speed intermediary, orchestrating data movement between HBM3 and NVMe.
NVMe storage: NVMe storage provides the capacity to hold the entire model, swapping parts in and out of HBM3 as needed.

DGX H100 Internal Fabric

Intra-node Fabric: NVLink, NVSwitch, PCIe Bus

NVLink

Each H100 GPU is equipped with 18 NVLink 4.0 ports, with each link providing 50 GB/s of bidirectional bandwidth (25 GB/s in each direction). The total aggregate bandwidth of 900 GB/s is distributed across direct GPU-to-GPU connections and NVSwitch connections.

Specifically:

14 NVLink ports for direct GPU-to-GPU connections with the other 7 GPUs in the node, with each GPU pair connected via 2 NVLinks, providing 100 GB/s of bidirectional bandwidth per H100 (2 links × 50 GB/s per link).
4 NVLink port connect each H100 to two NVSwitch devices in the system, with 2 NVLinks per NVSwitch (100 GB/s).

NVSwitch

Two NVSwitch devices are integrated within the DGX H100 system. While NVLink handles direct intra-node GPU-to-GPU communication, the NVSwitch acts as a crossbar switch, allowing any GPU to communicate with any other GPU in the node when direct NVLink paths are fully utilized or indirect routing is necessary.

PCIe Bus

While intra-node communication primarily uses NVLink, the PCIe Gen5 bus is critical for connecting GPUs to the CPU, peripherals, and storage, such as the NVMe SSDs. PCIe Gen5 x16 offers 128 GB/s of bidirectional bandwidth.

GPU-to-GPU vs. GPU-to-CPU Bandwidth (Intra-node)

For AI model training, particularly for trillion-parameter models, the NVLink bandwidth is well-matched to the needs of GPU-to-GPU communication. The PCIe bandwidth for GPU-to-CPU communication is generally not a bottleneck, since the GPUs are the primary workhorses for AI computations. The workload for GPU-to-CPU interactions (like data loading or orchestration) does not require the same intense bandwidth as GPU-to-GPU synchronization.

GPU-to-GPU Communication (Intra-node)

For workloads such as training trillion-parameter models, GPU-to-GPU communication is critical for synchronizing model parameters and gradients. AI training involves frequent all-reduce operations, where GPUs exchange intermediate results (e.g., gradients) across the entire node. The 100 GB/s bandwidth available between any two GPUs in the DGX H100 (via NVLink 4.0) is well-suited to handle the intensive data exchange required by such large models. Additionally, the low-latency nature of NVLink ensures minimal delays during these operations.

GPU-to-CPU Communication (Intra-node)

In contrast, GPU-to-CPU communication is less frequent in typical AI training workloads. The CPUs in a DGX H100 primarily handle data orchestration, I/O management, and coordination tasks, while the GPUs handle the bulk of the computational workload. The 128 GB/s bandwidth offered by the PCIe Bus is significantly lower than the total aggregate 900 GB/s GPU-to-GPU bandwidth provided by NVLink. However, this is generally not a bottleneck for AI training because:

Most training data and operations are GPU-bound, and the CPUs are not heavily involved in the core mathematical computations or parameter updates that require high-bandwidth communication.
The CPUs primarily manage auxiliary tasks such as data preprocessing or moving batches into GPU memory, which do not demand the same level of bandwidth as GPU-to-GPU operations.

Inter-node Fabric: NVSwitch, PCIe, and ConnectX-7 NICs

NVSwitch and PCIe

To connect the internal (intra-node) high-speed network to the external network infrastructure, GPUs connect via NVLink to the NVSwitch, Data flows from NVLink to NVSwitch, then through PCIe to the ConnectX-7 NICs which connect to the external networking fabric.The ConnectX-7 NICs manage all inter-node data exchanges, routing GPU traffic from within the node to the external network.

ConnectX-7x NICs

Each DGX H100 is equipped with 4 ConnectX-7 VPI adapters (NICs), each featuring dual 400 Gb/s ports. VPI stands for Virtual Protocol Interconnect, a technology that allows ConnectX-7 NICs to support both InfiniBand and Ethernet protocols on the same adapter. You can dynamically choose between InfiniBand and Ethernet without requiring separate hardware for each protocol.

Two ConnectX-7 NICs are connected to each of the two NVSwitch in the DGX H100 system, providing 800 Gb/s of egress bandwidth to each switch. This configuration delivers aggregate inter-node communication bandwidth of 1.6 Tb/s.

When training trillion-parameter models across multiple DGX In large-scale distributed training, operations like gradient synchronization, model state updates, and all-reduce operations traverse ConnectX-7 NICs across multiple nodes.

The aggregate 1.6 Tb/s inter-node bandwidth is generally well-matched for training large models, especially when combined with RDMA (Remote Direct Memory Access), a feature that allows direct GPU to GPU communication between nodes. However, as the number of nodes and GPUs grows, network congestion and latency can become limiting factors, making advanced network topologies and congestion-aware routing strategies important for scaling further. (To be covered in future articles.)

Unboxing the Box: Inside Physical Layout

Baseboard: The Heart of the System

At the core of the DGX H100 is its baseboard, which acts as the central hub connecting several critical components.

Dual AMD EPYC CPUs: These two powerful processors coordinate system-level tasks and offload some computation, helping manage and schedule the workload across all GPUs.
System RAM Slots: Surrounding the CPUs, you'll find slots for high-speed system memory. This RAM is critical for handling the data passed between the CPUs and GPUs.
NVMe SSD Connectors: Alongside the CPUs and RAM, you'll see connectors for the NVMe SSDs housed at the front of the chassis.
PCIe Slots for ConnectX-7 NICs: This is where the ConnectX-7 NICs are slotted in. The PCIe Gen5 connections here provide up to 128 GB/s of bandwidth per slot, ensuring the NICs can handle the vast amounts of data required for inter-node communication in a clustered environment.

GPU Baseboards: Where the Real Processing Power Lives

Moving towards the center of the system, we encounter the GPU baseboards, the powerhouses of the DGX H100.

H100 GPUs: Spread across these baseboards are 8 H100 GPUs. These GPUs are arranged across multiple baseboards, each of which typically holds 2 to 4 GPUs. This modular configuration allows for optimal cooling and power distribution.
Intra-node GPU Communication: Each GPU is connected via NVLink 4.0. The GPUs communicate directly with each other for fast, low-latency data transfers, using a total of 14 NVLinks per GPU to connect to the others, with an additional 4 NVLinks connecting to the NVSwitch for more complex communication paths.

NVSwitch Board: The Data Highway

The NVSwitch board sit next to the GPU baseboards, forming the data routing hub that connects the GPUs internally and to external components via PCIe. Each of the two NVSwitch chips, approximately 2-3 inches, is mounted on the NVSwitch board. Surrounding the NVSwitch chips, multiple NVLink traces run to each of the eight GPUs in the DGX H100 system.

Power Distribution Board: Keeping Everything Running

In the rear section, you'll find the power distribution board.

Redundant Power Supplies: The system is designed with redundant power supplies, ensuring that if one power unit fails, the others can continue to power the system, minimizing downtime.
Optimized Power Delivery: The power distribution board efficiently manages power to each component, ensuring that the GPUs, CPUs, and storage devices receive adequate power without any drop-offs, even under heavy computational loads.

Internal Layout: Cooling and Performance

The DGX H100 system is designed with a focus on both performance and efficient cooling to manage the significant heat generated by its high-performance components. The internal layout is organized into three key sections, each serving a role in maintaining optimal airflow and cooling efficiency.

In the front section, NVMe drives are positioned for easy access while also supporting airflow throughout the system. The front-mounted fans play a critical role in drawing cool air into the chassis, forming the first layer of the air-cooled system.
The middle section houses the GPUs and NVSwitches, which are the primary sources of heat within the system. This area is equipped with a liquid cooling distribution system to handle the substantial thermal output of the GPUs, ensuring sustained performance and preventing thermal throttling.
In the rear section, the dual AMD EPYC CPUs and memory are located away from the heat-intensive GPUs, benefiting from additional airflow. This section also contains the ConnectX-7 network adapters and redundant power supplies, with additional fans to expel hot air, completing the system’s cooling architecture.

Reliability, Availability, and Serviceability (RAS)

In next article, we will cover, "Reliability, Availability, and Serviceability (RAS)", features that ensure minimal downtime, proactive monitoring, and quick recovery from hardware or software issues.

For more details, see Article 5: NVIDIA DGX H100, Reliability, Availability, and Serviceability (RAS).

Energy Requirements: Power, Efficiency, and Cooling

The NVIDIA DGX H100 is designed for high-performance AI workloads and thus requires a well-structured power and cooling infrastructure. To manage the heat generated by the powerful GPUs, the system uses a hybrid cooling approach—combining liquid cooling for the GPUs and air cooling for CPUs and other components. For large deployments, external cooling solutions from third-party vendors are often integrated to ensure the system operates within optimal thermal conditions.

For more details, see Article 6: NVIDIA DGX H100, Power and Cooling requirements.

Scalability and Cluster Architecture

The DGX H100 is designed to scale both "vertically" and "horizontally". This scalability is essential for tasks ranging from intra-node performance optimization to the creation of clusters, and superclusters that handle distributed workloads across multiple systems.

Vertical Scaling (Intra-node)

The DGX H100 utilizes NVLink and NVSwitch technologies to scale within a single node, enabling fast intra-node GPU communication for data-intensive tasks.

Horizontal Scaling (Inter-node)

For workloads that exceed the capabilities of a single node, multiple DGX H100 systems can be clustered together using high-bandwidth networking such as InfiniBand or Ethernet to form a distributed computing cluster.

What is a Cluster?

A cluster is a collection of interconnected computers (or nodes) that function together as a unified computational resource. In the case of the DGX H100, clusters are formed by linking multiple DGX nodes using high-speed networking (InfiniBand or Ethernet), creating a distributed environment capable of processing massive datasets and training AI models that would be impossible to handle on a single machine.

Building a DGX H100 Cluster

Building a DGX H100 cluster involves connecting multiple DGX H100 systems through high-speed networking interfaces. Each node contributes its 8 GPUs, CPUs, memory, and storage to the shared computational pool. Clusters can be scaled by adding additional nodes to handle more data and larger AI models, ensuring efficient parallel processing.

Rent or Buy?

In addition to purchasing a DGX H100, organizations can access H100 GPU power through cloud service providers. For example, Microsoft Azure offers “ND_H100_v5-series” virtual machines, which offer a configuration very similar to the DGX H100 system. While Azure does not explicitly label these instances as DGX H100, the specifications and architecture—including the use of 8 H100 GPUs— suggest they are very similar. This cloud-based option allows users to tap into the power of the H100 GPU infrastructure without the capital expenditure of acquiring physical systems.

Renting vs. Buying DGX H100 Resources

Renting: Cloud-based solutions like Azure's ND_H100_v5-series offer flexible, pay-as-you-go pricing models for variable workloads, trials, or short-term projects.
Buying: For organizations with high, consistent utilization, purchasing a DGX H100 may be more cost-effective in the long run, despite the high upfront costs.

Cost Comparison: Cloud vs. On-Premise

As of 2024, renting an 8-GPU H100 instance on Azure costs approximately $100 per hour or $72,000 per month for 24/7 usage. Over six months, this totals around $432,000, roughly the cost of purchasing a DGX H100. For short-term needs, renting is cost-effective, but for long-term, full-capacity usage, purchasing may save costs over time.

Technical Implications of Cloud vs. On-Premise Deployments

When deciding between cloud and on-premise (on-prem) deployments, consider these technical factors:

Network Latency: On-prem typically has lower latency, which can be crucial for distributed training across multiple DGX H100 systems.
Customization: On-prem allows for greater hardware and software customization, potentially enabling optimizations specific to your AI workloads.
Scaling: Cloud offers more flexible scaling, allowing you to rapidly increase or decrease your GPU resources based on demand.
Maintenance: On-prem requires in-house IT expertise for maintenance and upgrades, while cloud handles this for you.

Conclusion

In conclusion, the NVIDIA DGX H100 is a versatile system that can operate as a standalone node or be integrated into larger-scale infrastructures, such as clusters and superclusters. Its core hardware—featuring 8 NVIDIA H100 GPUs interconnected via NVLink and NVSwitch—ensures high-bandwidth, low-latency communication within a single node, making it well-suited for demanding computational tasks like training large language models.

For larger-scale deployments, the ConnectX-7 NICs enable high-speed inter-node communication, allowing the system to scale efficiently across multiple nodes. The DGX H100 supports both vertical scaling (within a single node) and horizontal scaling (across nodes), providing flexibility for a variety of deployment scenarios, including scaling up as high-performance infrastructure for AI training and inference.

In the next article in this series, Article 5, “NVIDIA DGX Reliability, Availability, and Serviceability (RAS)”, we’ll dive into how the DGX H100 is designed to ensure continuous operation, minimize downtime, and facilitate efficient maintenance in high-demand environments.

SUPERCLUSTER

Discussion about this post