2/20. AI Supercluster: NVIDIA GPU Architecture & Evolution

From H100 to to B200 / GH200 / GB200 "Superchips"

Sep 03, 2024

Introduction

Graphics Processing Units (GPUs) have emerged as the cornerstone of AI supercomputing. Originally designed for rendering high-resolution graphics and later repurposed for cryptocurrency mining, GPUs have found a third life as AI compute engines.

Their highly parallel architecture, capable of performing millions—and now billions—of simultaneous calculations, is ideal for matrix operations, which form the basis of machine learning algorithms.

Let’s explore NVIDIA's GPU architecture and examine its evolution across recent generations.

Key GPU Performance Dimensions

To understand the capabilities of modern GPUs in AI supercomputing, let's examine these key performance dimensions:

Compute Power: Measured in TFLOPS (trillion floating-point operations per second), this represents the raw computational capability of the GPU.
Memory Capacity: The amount of on-board memory available for storing model parameters and intermediate results. Higher memory capacity allows for processing larger models and datasets, reducing the need for off-chip memory access, which can be a bottleneck.
Memory Bandwidth: The rate at which data can be transferred to and from the GPU's memory. Higher memory bandwidth minimizes bottlenecks and ensures that the GPU cores are consistently fed with data.
Process Node: The manufacturing process technology, measured in nanometers (nm). Smaller process nodes generally lead to higher transistor density, improved performance, and reduced power consumption.
Interconnects: High-speed technologies like NVLink and PCIe that enable efficient communication between GPUs and between GPU and CPU. Higher bandwidth allows for faster data exchange and improved scalability in multi-GPU systems.
Specialized Hardware: Components like Tensor Cores and Transformer Engines, designed to accelerate specific operations common in AI workloads.

Evolution of NVIDIA GPUs

NVIDIA's GPUs have consistently evolved across generations, improving key performance dimensions. The following table illustrates the significant advancements made with each generation, particularly in compute power, memory capacity, and interconnect bandwidth.

Decoding NVIDIA's Codenames

NVIDIA uses codenames for its GPU architectures during development, often inspired by prominent scientists and researchers. These codenames are then translated into official product names upon release:

Ampere: Named after André-Marie Ampère, a French physicist and mathematician. Official product name: A100.
Hopper: Named after Grace Hopper, a pioneer in computer programming. Official product names: H100 and GH200 (1 Hopper GPU plus 1 Grace CPU).
Grace: Also named after Grace Hopper, this refers to NVIDIA's CPU architecture designed for high-performance computing and AI workloads. Official product names: GH200 and GB200.
Blackwell: Named after David Blackwell, an American statistician and mathematician. Official product names are the B200 (1 Blackwell GPU) and GB200 (2 Blackwell GPUs plus 1 Grace CPU).

Note: there's some naming confusion due to internal or supplier-specific documents leaking to the press (e.g., "B100"). However, the official NVIDIA product names are as listed above.

Optimizing GPU Performance for Trillion-Parameter Models

When training trillion-parameter models, AI engineers strive to optimize two key aspects of system performance:

Speed: Faster training enables more experimentation, iteration, and ultimately, quicker deployment of AI solutions. GPUs with higher compute power, memory bandwidth, and efficient interconnects contribute to faster training times.
Energy Efficiency: The energy consumption of large-scale AI training can be substantial. Power-efficient GPUs with features like Dynamic Voltage and Frequency Scaling (DVFS), Multi-Instance GPU (MIG), and specialized hardware accelerators help reduce energy costs and environmental impact.

Each new generation of NVIDIA GPUs brings advancements that improve system performance on multiple fronts:

A100 to H100: The H100 significantly enhances compute power with improved Tensor Cores, increases HBM3 memory bandwidth to 3 TB/s, boosts NVLink interconnect speed to 900 GB/s, and supports FP8, leading to faster training times and improved energy efficiency.
H100 to H200: The H200 upgrades to HBM3e with 144GB memory and 4.8 TB/s bandwidth, and introduces an improved Transformer Engine, resulting in even faster and more efficient training for large models.
B200: The B200 continues the progression to denser compute, higher GPU memory, and higher bandwidth to GPU memory and other GPUs.
GH200: By integrating one H200 GPU and one Grace CPU on the same package, and interconnecting with a high-speed internal NVLink, the GH200 optimizes the training of massive AI models that require tight CPU-GPU collaboration.
GB200: By integrating two B200 GPUs and 1 Grace CPU on the same package, and interconnecting with a high-speed internal NVLink, the GB200 further increases the density of compute.

This articles delves into each of these key generational leaps in GPU technology and the performance improvements they deliver.

Key Components of a Modern GPU

The GPU is a sophisticated piece of hardware, packed with components that work together to deliver high performance for AI workloads. Some of the key components in NVIDIA GPU include:

Tensor Cores: These are specialized processing units designed to accelerate matrix operations, which are fundamental to deep learning. The A100 features 432 third-generation Tensor Cores, while both the H100 and H200 are equipped with 528 fourth-generation Tensor Cores.
Transformer Engine: This is a specialized hardware component integrated directly on the GPU die. Unlike the A100, which relied on Tensor Cores for transformer operations, the H100 and H200 have a dedicated engine. The H200 features a 2nd generation engine that enhances FP8 and memory management.
HBM3 Memory: High Bandwidth Memory is a technology that stacks memory chips vertically and places them very close to the GPU die.
NVLink 4.0 and NVLink-C2C: These are high-speed interconnect technologies that enable efficient communication between multiple GPUs within a node or between nodes in a cluster.
PCIe 5.0 Interface: This connects the GPU to the rest of the system, providing a pathway for data transfer between the GPU and the CPU or other peripherals.

Sidebar: Calculating GPU Memory Requirements for LLMs

GPUs require increasingly larger memory to accommodate the growing size of AI models. Let's examine the memory requirements for large language models and how GPU technology is evolving to meet these demands.

Memory requirements for large language models:

500 billion parameter model (FP8): 500 billion * 1 byte = 500 GB
1 trillion parameter model (FP8): 1 trillion * 1 byte = 1000 GB = 1 TB

Additional memory considerations include optimizer state, activation memory, intermediate computations and gradients. So in total, memory requirements could be 2-4x the model size.

Total memory needs:

500 billion parameter model: ~2 TB
1 trillion parameter model: ~4 TB

Even state-of-the-art GPU memories can't store these models completely, necessitating techniques like parallelism that split the model across multiple GPUs.

HBM Evolution: Powering GPU Memory Demands

High Bandwidth Memory (HBM) is a type of 3D-stacked DRAM commonly referred to as "GPU memory," distinguishing it from system RAM connected to the CPU.

HBM is physically instantiated as multiple DRAM dies stacked vertically on the same package as the GPU die, connected via an interposer, enabling higher bandwidth and lower power consumption compared to traditional GDDR memory.

Recent HBM advancements include:

Increased die stacking for higher capacity
Improved manufacturing processes for denser memory cells
Enhanced interconnect technology for faster data transfer
Transition from HBM3 to HBM3e, offering higher speeds and capacity

These improvements have significantly boosted both capacity and bandwidth. For example, NVIDIA's transition from H100 to H200 GPU nearly doubled memory capacity (80GB to 141GB) and increased bandwidth by 60% (3 TB/s to 4.8 TB/s).

The impact of these advancements is twofold:

Larger capacity allows more of the model to reside on each GPU, reducing communication overhead and improving training efficiency.
Higher bandwidth accelerates data transfer between memory and GPU, leading to more rapid iterations during training and faster response times during inference.

These enhancements are crucial for supporting the growing demands of large AI models, enabling more efficient training and inference processes.

Next, let’s examine other process developments.

Impact of Process Node Advancements

The transition from a 7nm process node (Ampere) to a 5nm node (Hopper) and further to a customized 4N node (Blackwell) has several significant impacts:

Transistor Density: Smaller process nodes allow more transistors to be packed into the same area, leading to increased compute density and higher performance.
Power Efficiency: Smaller transistors generally consume less power, contributing to improved energy efficiency and reduced operating costs.
FLOPS per GPU: The combination of increased transistor density and architectural improvements leads to higher FLOPS per GPU, enabling faster and more efficient AI computations.
Custom Integration: The 4N process, a customized version of TSMC's 4nm process specifically for NVIDIA, enables the integration of multiple chips and high-bandwidth interconnects within a single package.

GPU to GPU Communications: High-Speed Interconnect (NVLink)

NVLink is NVIDIA's technology to connect multiple GPUs within a node and to link nodes to the NVLink Switch System for scaling to larger clusters. As models are split across many GPUs during training of large AI models, efficient GPU-to-GPU communication becomes crucial.

In the latest NVIDIA GPUs, NVLink is integrated directly into the GPU's silicon die, with dedicated controllers and high-speed circuits on-chip. These NVLink interfaces are physically manifested as gold-finger connectors on the GPU board, interfacing with the system's NVLink interconnect fabric.

Number of NVLinks and Bandwidth:

A100: Supports 16 NVLink connections with a maximum bandwidth of 600 GB/s per link. (NVLink 3.0)
H100 and H200: Supports 18 NVLink connections with a maximum bandwidth of 900 GB/s per link.(NVLink 4.0)

This increased NVLink bandwidth in newer generations significantly improves multi-GPU communication, reducing bottlenecks and enabling more efficient parallel processing for AI workloads.

GPU to CPU Communications: PCIe

The PCIe (Peripheral Component Interconnect Express) interface is crucial for connecting the GPU to the CPU and other components within a node:

Within the Node: PCIe facilitates communication between the GPU and CPU, and connects to other PCIe devices like storage and network adapters.
Between Nodes: In multi-node H100 clusters, PCIe links GPUs to InfiniBand adapters, enabling inter-node communication over the InfiniBand network.

During multi-GPU training, PCIe serves two primary functions within the node:

Data Transfer: Moving data between CPU memory and GPU memory, as well as exchanging data between GPUs within the node.
Synchronization and Coordination: Communicating gradients, model updates, and other control information between GPUs and the CPU during training.

The primary performance metric is bandwidth, and as the NVIDIA GPU architecture has evolved, the PCIe bidirectional bandwidth has doubled:

A100 (PCIe 4.0): 128 GB/s
H100 and H200 (PCIe 5.0): 256 GB/s

While not the primary conduit for GPU-to-GPU data transfer, PCIe plays a vital role in:

Facilitating essential CPU-GPU communication
Connecting high-speed storage devices
Interfacing with other system components

This CPU-GPU interaction is critical for orchestrating the overall training process.

NVIDIA GB200 Superchip: State of the Art (For now)

The NVIDIA Grace Blackwell GB200 Superchip represents a significant advancement in GPU technology by integrating two B200 GPU and one Grace CPU on a single package. This tight coupling creates a high-bandwidth computing solution that addresses traditional CPU-GPU bottlenecks, reducing latency and increasing data transfer speeds.

The integrated design also improves power efficiency and enables enhanced CPU-GPU coordination. A single GB200 Superchip, comprising one Grace CPU and and two B200 GPU connected via NVLink-C2C, can be considered a node.

Key components and features of a GB200 Superchip node include:

Grace CPU:
- Custom ARM-based silicon optimized for AI and HPC workloads
- Features high core count and large cache sizes
- Includes 480 GB LPDDR5 CPU memory
B200 GPU:
- Latest generation NVIDIA GPU designed for AI and HPC tasks
- Tightly integrated with the Grace CPU
NVLink-C2C (Chip-to-Chip):
- High-speed, low-latency interconnect between Grace CPU and B200 GPU
- Internally, within the package, providing up to 900 GB/s bidirectional bandwidth
NVLink (external):
- 18 NVLink 5.0 connections for multi-node networking t
- Each connection provides up to 900 GB/s bidirectional bandwidth

Compared to previous generations with external CPUs communicating via PCIe, the GB200's integrated design offers several advantages:

NVLink-C2C: Unifying GPU and CPU Resources on GB200

Introducing NVLink-C2C

NVLink-C2C in NVIDIA's GB200 Grace Blackwell Superchip is a GPU-CPU interconnect technology that enables seamless memory oversubscription and direct, high-bandwidth CPU memory access from the GPU.

This technology delivers 900 GB/s bidirectional bandwidth through an on-package interconnect—a 14x leap from the H100's PCIe Gen4 (64 GB/s bidirectional). It significantly reduces latency for GPU-initiated CPU memory operations, approaching that of local GPU memory accesses.

NVLink-C2C effectively creates a unified memory pool combining GPU HBM and CPU DRAM. The GPU can access CPU memory almost as if it were additional local high-bandwidth memory, eliminating many traditional GPU memory capacity constraints. This particularly benefits large AI models or datasets exceeding GPU memory limits.

Latency Improvements Enabled by NVLink-C2C

The direct, on-package connection between GPU and CPU via NVLink-C2C significantly reduces communication delays compared to traditional PCIe connections. NVIDIA has publicly stated that the GB200's CPU-to-GPU memory access latency is less than 20 nanoseconds.

In contrast, the H100 GPU, which uses PCIe Gen5 for CPU-GPU communication, has a typical latency of around 400-600 nanoseconds for CPU-to-GPU memory access. This means the GB200 achieves a latency reduction of approximately 95-97% compared to the H100's PCIe-based connection.

This dramatic reduction in latency, combined with the increased bandwidth, enables much more efficient processing for AI and HPC workloads that require frequent communication between CPU and GPU. It's particularly beneficial for:

Tasks involving rapid, small data transfers
Processes requiring tight CPU-GPU coordination

Unified Memory Architecture Enabled by NVLink-C2C

The GB200 features a unified memory architecture combining:

192GB of HBM3e on a single B200 GPU, totaling 384GB for the entire GB200 comprised of 2 B200.
Up to 480GB of system memory (LPDDR5X) connected to the Grace CPU

This architecture enables coherent access to the entire memory space for both CPU and GPU, with transparent hardware-managed data movement between the two memory types.

In training trillion-parameter models:

HBM3e handles active computations and caches frequently accessed data
System memory stores the full model parameters and larger datasets
The system dynamically moves data between HBM3e and system memory based on computational needs during training

This unified memory system leverages HBM3e's high bandwidth for critical GPU operations while utilizing the larger capacity of system memory. It accommodates the extensive requirements of trillion-parameter models, even when parallelism divides the model across multiple GPUs.

How Lower GPU Precision Drives AI Model Scaling

As GPUs have advanced, they've increasingly supported lower precision floating-point (FP) numerical formats (from FP32 to FP16 to FP8) for AI computations. Yes, lower. While counterintuitive, this shift towards lower precision offers several advantages:

Reduced Memory Footprint: Lower precision formats require less memory for model weights and activations, enabling larger models to fit within available GPU memory.
Improved Performance: Specialized hardware like Tensor Cores can execute lower precision computations faster, leading to improved training and inference speeds.
Energy Efficiency: Processing lower precision data consumes less power, enhancing overall system efficiency.

The movement towards larger parameter counts in AI models has been a crucial driver in making quantization both possible and necessary. Quantization is the process of reducing numerical precision by representing information with fewer bits.

Larger models inherently exhibit redundancy, making them more resilient to minor inaccuracies introduced by lower precision formats. This resilience enables effective application of quantization techniques without significant accuracy loss, allowing optimization of memory usage and performance.

Leveraging advancements in AI model quantization techniques, NVIDIA has introduced:

FP16 (half-precision) support on the A100
FP8 support on the H100's enhanced Tensor Core

This progression to lower precision accelerates training and inference for large models, effectively doubling FLOPS (floating-point operations per second) under equivalent conditions.

Improving Power Efficiency

As GPUs have grown more powerful, managing their power consumption and heat generation has become increasingly important. Several factors contribute to power efficiency in NVIDIA GPUs:

Transistor Density: Smaller process nodes enable packing more transistors into the same area, leading to higher compute density and lower power consumption per transistor.
DVFS (Dynamic Voltage and Frequency Scaling): This allows the GPU to dynamically adjust its voltage and frequency based on workload demands, minimizing power consumption during idle or less demanding periods.
Multi-Instance GPU (MIG): This feature partitions a single GPU into multiple smaller instances, allowing for better resource utilization and power savings when workloads don't require the full GPU's capabilities.
Architectural Optimizations: Advancements in Tensor Cores, sparse matrix handling, and specialized hardware accelerators contribute to improved performance per watt.

NVIDIA has consistently improved GPU power efficiency:

Cooler Cooling Solutions

As GPUs become increasingly powerful to handle massive AI workloads, their power consumption and heat output naturally increase. This necessitates advancements in cooling solutions, not only to prevent overheating but also to ensure reliability and maintain peak performance. Some consequences of GPU overheating include:

Performance degradation: Thermal throttling reduces clock speeds, leading to decreased GPU performance and efficiency.
System instability: Overheating can cause data corruption, system crashes, and unexpected shutdowns.
Hardware damage: Prolonged exposure to high temperatures can permanently damage GPU components, reducing lifespan or causing complete failure.

NVIDIA's journey from the A100 to the H100 and now the H200/GH200 showcases a clear progression in addressing thermal challenges.

Note: TDP (Thermal Design Power) is a key metric for GPU cooling requirements, measured in watts. It represents the expected maximum heat output under typical workloads. Manufacturers use TDP to design cooling solutions. This standardized measure allows for GPU comparison and determines the necessary cooling capacity.

Air Cooling (A100)

With a TDP of 400W, the A100 could be effectively cooled with air-cooled solutions in most configurations. Key components of this cooling system include:

Heat sink: A structure that attaches to the heat spreader, providing a larger surface area for heat dissipation.
Heat spreader: A layer of material (usually copper or aluminum) that absorbs heat from the GPU die.
Thermal interface material (TIM): A substance that fills the gap between the heat spreader and the GPU die, enhancing heat transfer.

Liquid Cooling (H100/H200)

The H100 with a TDP of 700W and H200 with an estimated TDP of 900W significantly increased cooling demands, marking a turning point where liquid cooling becomes the primary solution. Advancements include:

Optimized Fin Design: The heat sink fins on the cold plate are meticulously designed to maximize surface area and promote optimal coolant flow.
Larger and Thicker Copper Heat Spreader: Increased surface area allows for better heat dissipation and more efficient transfer of thermal energy to the liquid cooling solution.
Improved Thermal Interface Materials (TIMs): Materials with superior thermal conductivity are used between the GPU die and the heat spreader, as well as between the heat spreader and the cold plate.
Standardized Liquid Cooling Connectors: Compatibility with industry-standard liquid cooling connectors, making it easier to integrate into liquid cooling infrastructure.

Direct-to-Chip Cooling (GB200):

The integrated design of the GB200 (2 B200 GPU and 1 Grace CPU in a single package) generates a combined TDP of approximately 3kW, demanding cutting-edge cooling solutions. This led to the introduction of direct-to-chip liquid cooling, where coolant flows directly over the GPU and CPU dies, bypassing the traditional heat spreader. This approach:

Minimizes thermal resistance
Maximizes heat transfer efficiency
Enables the GB200 to handle its high TDP and maintain peak performance even under the most demanding workloads

The Future of NVIDIA GPUs

NVIDIA's future GPU generations are likely to advance along several key dimensions:

Increased Compute Power: Expect continued improvements in transistor density and architectural optimizations. Future GPUs will likely feature more specialized hardware accelerators, particularly for AI workloads. The focus will be on increasing TFLOPS/mm², maximizing compute for a given physical size (die) of chip as we approach the limits of physics.
Larger and Faster Memory: Continued evolution of HBM technology will provide GPUs with larger memory capacities and higher bandwidth.
Enhanced Interconnects: Enhancements to NVLink and PCIe technologies will be critical for scaling AI workloads across multiple GPUs and nodes. The emphasis will be on increasing GPU-GPU bandwidth and reducing latency for distributed training scenarios.
GPU-CPU Integration: Building on the GB200's design, future generations may explore more unified architectures. This could lead to new programming models that blur the lines between traditional compute and AI acceleration.
Thermal Management: As power density increases, innovations in cooling technology will become essential. Future designs may incorporate advanced on-chip cooling solutions or explore novel system-level thermal management approaches.
Power Efficiency: Improving FLOPS/watt will remain a priority as AI models grow in size and complexity. This may involve more sophisticated power management techniques at both the chip and system levels.

Conclusion

The evolution of NVIDIA's GPU architecture represents a pivotal force in the advancement of AI supercomputing. From the A100 to the H100 and the GB200 Superchip, each generation has brought significant improvements in compute power, memory capacity and bandwidth, interconnect speeds, and energy efficiency. As we look to the future, the challenges of training increasingly large and complex AI models will continue to drive innovation in GPU design and architecture.

To fully leverage the power of NVIDIA's GPU architecture, a robust software ecosystem is essential. In Article 3, "Software Ecosystem", we'll explore the AI Supercluster Software Ecosystem, including NVIDIA CUDA, NCCL, and popular deep learning frameworks like TensorFlow and PyTorch, which enable seamless scaling and optimization of AI workloads across superclusters.

SUPERCLUSTER

Discussion about this post