7/20. AI Supercluster: NVIDIA DGX Platform Evolution

From DGX H100 to DGX B200 to GB200 NVL72

Sep 25, 2024

Introduction
Architectural Evolution and System Variants
Performance and Scalability
Power and Cooling Advancements
Physical Configuration and Data Center Requirements
Inter-Node Networking
Memory Innovations
Software Evolution: CUDA and NCCL
Security Features
Conclusion

1. Introduction

NVIDIA's AI Platforms like the DGX series are at the forefront of AI computing infrastructure, enabling advancements in natural language processing, computer vision, and scientific simulations. The progression from DGX H100 to DGX B200 to GB200 NVL72 represent a major advancement in AI hardware capabilities, directly affecting the scale and complexity of models that researchers and companies can train and deploy.

Each generation introduces improvements to memory architecture, interconnect technologies, and software stack. These advancements collectively enhance the capabilities of AI computing infrastructure, allowing for faster training of larger models and more efficient inference at scale.

2. Architectural Evolution and System Variants

The NVIDIA DGX system line has seen considerable evolution since the introduction of the DGX H100. This evolution has taken two primary paths: iterative updates on standalone systems which have incorporated more and more powerful compute and also the shift towards integrated CPU-GPU architectures and the modularity required for truly massive datacenter an supercomputing applications.

Iterative Updates of standalone systems: DGX H100 → DGX B200

The first evolutionary path involves iterative updates to the GPU architecture of standalone systems while maintaining overall system compatibility. This path includes:

DGX H100: The baseline system, featuring NVIDIA H100 GPUs.
DGX H200: An iteration introducing H200 GPUs with improved memory capacity and bandwidth.
DGX B200: The latest iteration, incorporating B200 GPUs with further enhancements.

These systems maintain architectural continuity with the H100, allowing for easier upgrades and integration into existing clusters. The primary improvements in each iteration focus on:

Increased GPU memory capacity
Higher memory bandwidth
Improved FP8 performance for AI workloads
Enhanced energy efficiency

For organizations with existing DGX H100 clusters, upgrading to DGX H200 or DGX B200 systems can provide a performance boost without requiring a complete infrastructure overhaul. These upgrades are particularly beneficial for workloads that are memory-bound or require higher precision calculations.

Integrated CPU-GPU Architecture: DGX GH200 & GB200 NVL72

The second evolutionary path represents a more fundamental shift in system design:

DGX GH200: Introduces the Grace Hopper architecture, combining NVIDIA Grace CPUs with Hopper GPUs.
GB200 NVL72: The latest iteration, pairing Grace CPUs with Blackwell GPUs in a modular system.

This path marks a departure from traditional discrete CPU-GPU designs, offering a tightly integrated architecture that provides several key innovations:

Unified memory architecture between CPU and GPU
Reduced latency for CPU-GPU communication
Improved energy efficiency through tighter integration
Optimized memory coherence and data sharing

For new deployments, "greenfield" projects, the DGX GH200 can accelerate AI workloads that require frequent CPU-GPU interaction or benefit from large shared memory pools. For the most massive scale deployments, the GB200 NVL72 architecture offers modularity, and potential to support interconnecting 8 GB200 NVL72 systems, totaling 576 GPUs in a single NVLink domain.

3. Performance and Scalability

To illustrate the evolution across DGX generations, consider the following comparison table:

This table outlines the evolution of NVIDIA's DGX systems, showcasing significant advancements in GPU technology, system architecture, and performance metrics. Note: petaFLOPS are for AI Training workload assuming FP8.

Key observations include:

A steady increase in total petaFLOPS (FP8) from 100 in the DGX H100 to 3300 in the GB200 NVL72, representing a 33-fold improvement.
Performance efficiency (FLOPS/$) directly represents how much computational power you get for each dollar spent on the system. A higher value indicates better cost efficiency. It improves consistently across generations, with the GB200 NVL7 offering the best value at 2224 MFLOPS per dollar.
- DGX B200: 1283 MFLOPS/$ (87.3% improvement over H200)
- GB200 NVL72: 2224 MFLOPS/$ (28.8% improvement over GH200)

This progression demonstrates NVIDIA's focus on scaling up AI computing power, with each new generation offering substantial improvements in raw performance, energy efficiency, computational density, and performance-cost ratio.

4. Power and Cooling Advancements

As DGX systems have evolved to deliver increased computational power, managing their energy consumption and thermal output has become increasingly critical. NVIDIA has made strides in power efficiency and cooling technologies to address these challenges.

Power Requirements and Efficiency

The total power required per DGX system has increased with each generation, reflecting the growing computational capabilities and GPU counts. Here's an approximate breakdown of power requirements:

1. DGX H100 (8 GPUs):

GPUs and host system: ~10-15 kW
Networking switches: ~1-2 kW
PDUs and UPS: ~1-2 kW
Cooling systems: ~3-5 kW (varies based on cooling method)
Total per system: ~15-24 kW

2. DGX H200/B200 (8 GPUs):

Similar to H100, with potential 5-10% increase due to higher performance
Total per system: ~16-26 kW

3. DGX GH200 (32 GPUs):

GPUs, CPUs, and host system: ~80-100 kW
Networking and integrated switches: ~5-7 kW
PDUs and UPS: ~8-10 kW
Cooling systems: ~20-25 kW
Total per system: ~113-142 kW

4. GB200 NVL72 (72 GPUs):

GPUs, CPUs, and host system: ~180-220 kW
Networking and integrated switches: ~10-15 kW
PDUs and UPS: ~18-22 kW
Cooling systems: ~45-55 kW
Total per system: ~253-312 kW

Despite the substantial increase in total power consumption, each generation has seen improvements in power efficiency:

DGX H100 → H200: ~10-15% improvement in FLOPS per watt
DGX H200 → B200: Projected 20-30% improvement in FLOPS per watt
DGX GH200 → GB200 NVL72: Estimated 30-40% improvement in FLOPS per watt

These efficiency gains play a critical role in data center operations, enabling the maximization of computational density while addressing energy costs and environmental concerns.

Despite the escalating power needs, each successive generation demonstrates a marked improvement in FLOPs per watt. This trend underscores NVIDIA's commitment to advancing performance while simultaneously enhancing energy efficiency, a crucial factor for sustainable AI infrastructure scaling.

Evolution of Cooling

NVIDIA has introduced tailored cooling innovations for each DGX generation to manage their increasing thermal output:

1. DGX H100:

Air cooling: Utilizes 4U chassis with 8 high-efficiency, variable-speed fans
Airflow: Front-to-rear, 1,200 LFM (linear feet per minute)
Liquid cooling option: Supports direct-to-chip cold plate solutions
TDP: 700W per GPU, 10.2kW total system power

2. DGX B200 (based on available information, some details may be projected):

Air cooling: Likely 4U chassis with upgraded fan system
Expected airflow: >1,500 LFM to accommodate higher TDP
Liquid cooling: Cold plate solutions standard, with potential for facility water integration

3. DGX GH200:

Primarily designed for liquid cooling
Cooling solution: Integrated direct-to-chip liquid cooling
Supports facility water cooling with 18°C to 32°C inlet temperature

4. GB200 NVL72:

Exclusively liquid-cooled design
Two-phase liquid cooling or cold plate with sub-ambient liquid
Likely to require facility water temperatures below 15°C for optimal performance

Across all generations:

Immersion cooling compatibility: All systems support single-phase and two-phase immersion cooling
Thermal management: NVIDIA BMC (Baseboard Management Controller) with dynamic power capping and GPU thermal slowdown at 85°C

This evolution shows a clear trend towards more sophisticated cooling solutions to handle the increasing power density. The DGX H100 could operate efficiently with air cooling in many environments, while the GB200 NVL7 is explicitly designed for advanced liquid cooling infrastructures to manage its higher thermal output.

5. Physical Configuration and Data Center Requirements

The evolution of DGX systems has not only impacted their computational capabilities but also their physical characteristics and data center requirements. Understanding these changes is crucial for effective deployment and scaling of AI infrastructure.

Rack Unit (RU) Requirements, Form Factor, and Layout

The physical dimensions and rack space requirements of DGX systems have evolved across generations:

DGX H100:8 RU per system. Up to 5 systems per standard 42U rack, but 4 system is recommended for airflow
DGX H200/B200: Similar form factor to DGX H100. Maintains compatibility with existing rack configurations
DGX GH200/GB200: Larger form factor due to integrated CPU-GPU design. 16 RU per system
DGX GH200: 27 RU per system (32 GPUs)
GB200 NVL72: 42 RU of rack space, full rack (72 GPUs)

The trend towards higher compute density has led to more powerful systems occupying similar or slightly larger rack spaces, resulting in increased power and cooling demands per rack.

Typical rack configuration for DGX systems might include:

4 DGX H100/H200/B200 systems or 1 DGX GH200 systems
1-2 RU for high-speed networking switches
2-4 RU for PDUs and cable management
Remaining space for supporting systems or left empty for airflow management

6. Inter-Node Networking

The evolution of DGX systems has been closely tied to advancements in networking technologies, enabling scalability for AI workloads. This section explores the networking innovations across DGX generations and their impact on system scalability.

NVSwitch to NVLink Switch System: Expanding the Boundaries of GPU Interconnect

The evolution from NVSwitch to the NVLink Switch System represents a significant leap in NVIDIA's GPU interconnect technology:

1. DGX H100/H200/B200 (NVSwitch):

Enables high-bandwidth, low-latency communication between 8 GPUs within a single node
Facilitates full-mesh connectivity, allowing each GPU to access other GPUs' memory at high speed
Limited to intra-node communication

2. DGX GH200 (NVLink Switch System):

Expands the concept to 32 GPUs, and up to 256 GPUs in a single NVLink domain.
Enables direct GPU-to-GPU communication across multiple nodes
Creates a "pooled" GPU environment that appears as a single, larger system

3. GB200 NVL72 (Enhanced NVLink Switch System):

Further scales the technology to support 72 GPUs, and when multiple GB200 NV72, up to 576 GPUs in a single NVLink domain.
Cross-node direct GPU communication at a larger scale

Key benefits of this progression:

Dramatically reduced communication overhead for distributed training
Simplified programming model for large-scale AI applications
More efficient utilization of GPU resources across an entire cluster
Ability to scale models across hundreds or thousands of GPUs with improved efficiency

The NVLink Switch System effectively blurs the line between node-level and cluster-level GPU communication, creating a more unified and efficient compute environment for AI workloads.

Networking Progression for each DGX System

DGX H100

Inter-GPU Communication: NVLink 4.0 (3200 GB/s GPU-to-GPU bandwidth)
Host-to-GPU Communication: PCIe Gen5 (32 GT/s)
3rd generation NVSwitch
Cluster Connectivity: NVIDIA Quantum-2 InfiniBand (NDR 400Gb/s)
16x NVIDIA Quantum-2 InfiniBand ports

DGX H200/B200

Inter-GPU Communication: NVLink 4.0+ (enhanced 3600 GB/s GPU-to-GPU bandwidth)
Host-to-GPU Communication: PCIe Gen5 (32 GT/s)
Cluster Connectivity: NVIDIA Quantum-2 InfiniBand (NDR 400Gb/s)
16x NVIDIA Quantum-2 InfiniBand ports

DGX GH200 and GB200 NVL72

Inter-GPU Communication: NVLink Switch System (6400 GB/s GPU-to-GPU bandwidth)
CPU-GPU Communication: Integrated NVLink-C2C (Chip-to-Chip)
NVLink Switch System: Extends NVLink connectivity across multiple nodes, enabling GPU-to-GPU communication at the rack and pod level
Cluster Connectivity: NVIDIA Quantum-2 InfiniBand (NDR 400Gb/s)
32x NVIDIA Quantum-2 InfiniBand ports

7. Memory Innovations

The concept of NVLink-connected memory has improved memory access in DGX systems. For AI training purposes, NVLink-connected memory can be considered a "single memory space" or "contiguous memory," impacting model parallelism and distributed training approaches.

Note: GH200 and GB200 are both Superchips that have 2 GPUs + 1 CPU. Hence, a DGX CH200 has 16 GH200 Superchips, and a GB200 NVL72 has 36 GB200 Superchips.

The DGX GH200 and GB200 NVL72 models mark a significant shift in memory architecture with the introduction of Unified Memory. Unlike earlier generations (DGX H100, H200, and B200) that focused solely on GPU memory, the GH200 and GB200 treat both CPU and GPU memory as a single, contiguous NVLink-connected memory space.

The DGX GH200 offers 36TB of Unified Memory (4.5TB GPU memory across 32 GPUs, plus CPU memory), while the GB200 NVL72 provides 72TB (13.5TB GPU memory across 72 GPUs, plus CPU memory).

This architecture has profound implications for training large language models:

1. Capacity: Assuming FP8 (1 byte per parameter) and 4x memory overhead for training:

DGX GH200 can accommodate a 9 trillion parameter model.
GB200 NVL72 can handle an 18 trillion parameter model.

Note: 4X model parameter count is rule of thumb for estimating memory for training LLMs due to overheads, including optimizer states, gradients, forward activations for backpropagation, and temporary buffers for computations.

2. Reduced Model Splitting: Large models can fit entirely within the unified memory space, simplifying the training process.

3. Improved Model Coherence: Keeping larger portions of the model in a single memory space potentially improves model quality.

4. Dynamic Memory Utilization: More efficient resource use during different training phases.

The NVLink Switch System enables seamless data sharing between CPUs and GPUs, eliminating explicit data transfers and simplifying memory management.

Note: Despite increased capacity, distributed computing (AKA many interconnected DGX nodes) is still necessary for extremely large models to shorten training times into weeks and months instead of years. Additionally, software frameworks and AI algorithms will need to evolve to fully leverage this new memory paradigm.

8. Software Evolution: CUDA and NCCL

CUDA and NCCL implementations have evolved in parallel with DGX hardware advancements. Each new generation of GPUs requires software optimizations to fully utilize improved hardware capabilities.

The transition from DGX H100 to H200 saw CUDA evolve from version 11.8 to 12.x. This software update enabled:

Improved utilization of increased HBM3e memory bandwidth
Optimized memory management for larger GPU memory capacities
Enhanced support for new AI-specific instructions

Similarly, NCCL advancements from version 2.16 to 2.18 improved multi-GPU and multi-node communication, which is essential for distributed AI training on DGX clusters.

The introduction of the Grace Hopper architecture in DGX GH200 necessitated more extensive software changes. CUDA 12.3 and NCCL 2.19.3 introduced:

Support for the unified memory architecture between CPU and GPU
Optimizations for the NVLink Switch System
New algorithms for collective operations leveraging the tighter CPU-GPU integration

These software advancements work in conjunction with hardware improvements to deliver performance gains that exceed what hardware alone could achieve.

9. Security Features

As DGX systems have evolved, NVIDIA has continuously enhanced their security features to protect valuable AI workloads and sensitive data. Here's how security capabilities have progressed across generations:

DGX H100

Secure boot process with cryptographically signed firmware
Support for self-encrypted drives (SEDs)

DGX B200

Enhanced firmware resilience with NVIDIA-developed secure boot chain
Improved GPU memory encryption capabilities
Introduction of confidential computing features for sensitive AI workloads

DGX GH200

Hardware-accelerated encryption for CPU-GPU data transfers
Enhanced secure enclave support for protecting AI models and data in use
Improved isolation capabilities for multi-tenant environments

GB200 NVL72

Further advancements in hardware-based security, leveraging Blackwell GPU architecture
Enhanced support for homomorphic encryption, allowing computations on encrypted data
Improved secure multi-party computation capabilities for federated learning

Throughout these generations, NVIDIA has maintained a focus on providing a secure foundation for AI workloads, with each iteration building upon and enhancing the security capabilities of its predecessors.

10. Conclusion

The evolution of NVIDIA's DGX systems from DGX H100 to DGX GH100 to GB200 NVL72 has addressed key challenges in large-scale AI infrastructure: computational power, memory capacity, and data movement efficiency. These improvements enable AI researchers and engineers to train larger, more complex models with improved speed and efficiency.

Key advancements include:

Advances in GPU technology from H100 to H200 to B200 to GH200 to GB200.
Expansion in GPU memory, from 640GB to 64TB per system
Shift towards integrated CPU-GPU architectures for improved efficiency
Advanced cooling solutions to manage increasing power densities
Introduction of the NVLink Switch System to allow more and more GPUs to operate in a single NVLink domain. From 8 GPUs in the DGX H100 to 578 GPUs in the GB200 NVL72 architecture.

In the next article, "Article 8. Traversing the Network Fabric," we will explore the world of high-performance interconnects, including NVLink, NVSwitch, ConnectX SmartNIC, InfiniBand, and RDMA, which form the backbone of efficient communication in modern AI clusters.

Kiko

Mar 21Edited

This is a great article with the facts neatly summarized.

Just a clarification on "similar form factor": the DGX B200 form-factor is 10RU; they do list that on the datasheet at https://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet. Contrast to the DGX H100 which is 8RU; this isn't present on the datasheet, but is at least documented at https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html

Expand full comment

Paul Goll

Oct 20

You cannot find H100 at 159k list price, at least 250k+

SUPERCLUSTER

Discussion about this post