16/20. AI Supercluster: High-Performance Storage Systems

Oct 04, 2024

Introduction

In AI SuperClusters, high-performance storage systems are the backbone of large-scale AI model training. The vast datasets, complex computations, and immense I/O operations needed by models with trillions of parameters demand storage solutions that provide exceptional speed, scalability, and reliability. This article explores the advanced storage technologies, products, and practices that support environments featuring 10,000+ GPUs and focuses on their roles in facilitating seamless AI model training.

1. The Role of High-Performance Storage in Distributed Training

The process of training large-scale AI models, especially those with trillions of parameters, involves distributed operations across thousands of GPUs. One critical operation during this process is All-Reduce, which occurs during the backpropagation phase of model training. In this phase, gradients computed on each GPU must be aggregated and shared across all GPUs to synchronize the model's parameters. This operation primarily happens in high-bandwidth memory (HBM) on the GPUs and relies on fast network fabrics (e.g., NVIDIA NVLink & InfiniBand). However, high-performance storage plays a crucial supporting role throughout the training pipeline.

How Storage Supports Distributed Training

Several key aspects illustrate how storage systems support distributed training:

Data Loading and Prefetching: Input datasets, often spanning terabytes to petabytes, reside on high-performance storage systems (such as NVMe arrays and distributed file systems). These datasets need to be read into GPU memory before training can begin. High-throughput storage ensures that data can be preloaded or prefetched quickly into HBM, reducing I/O bottlenecks. By keeping a steady flow of data, GPUs are never starved during the intensive compute operations, including All-Reduce. This ties into the data management strategies discussed in Article 14, which optimize the placement and movement of data to align with compute demands.
Checkpointing and Fault Recovery: During large-scale model training, periodic checkpointing saves the current state of the model (including weights, gradients, and optimizer states) to high-performance storage. This is critical for fault tolerance; if a failure occurs, training can resume from the last checkpoint, minimizing lost progress. Article 15 delves into advanced checkpointing techniques, such as incremental checkpointing and asynchronous checkpointing, which leverage parallel file systems to reduce I/O burden and maintain uninterrupted training. In this context, NVMe storage plays a crucial role by offering the necessary bandwidth for fast checkpointing, integrating seamlessly with these strategies.
Data Partitioning and Shuffling: Distributed training requires partitioning or shuffling input data across multiple GPUs. High-performance storage facilitates rapid data access and partitioning, allowing efficient data distribution in the training pipeline. This process aligns with model partitioning and load balancing techniques from Article 15, which describe how data movement and partitioning strategies minimize inter-node communication during the All-Reduce phase.
Model Storage and Intermediate Data Management: While active training uses GPU memory, the model's parameters, intermediate results, and checkpoints are stored on high-performance storage. This storage is essential for managing intermediate computations that may need to be offloaded due to memory constraints during large-scale training. Techniques like activation checkpointing and mixed-precision training from Article 15 reduce memory footprint, which in turn affects the demands placed on storage systems for checkpointing and data retrieval.

By enabling rapid data access, efficient checkpointing, and seamless data partitioning, high-performance storage ensures that distributed training processes like All-Reduce can proceed without bottlenecks.

2. Challenges of AI Storage Systems in SuperClusters

The role of high-performance storage in AI training must address several critical challenges:

I/O Throughput: The need to handle operations per second (IOPS) and throughput rates reaching hundreds of gigabytes per second.
Latency: Low-latency access is crucial for data prefetching, checkpointing, and model recovery, preventing stalls in training due to data unavailability. The optimization algorithms and parallel training techniques discussed in Article 15 emphasize the importance of minimizing delays caused by data access and synchronization.
Scalability: Storage systems must efficiently scale with the growth of AI datasets, which can expand to petabyte scales.

These challenges necessitate a combination of advanced storage technologies, such as NVMe, in-memory caching, SANs, and distributed file systems, along with optimized orchestration of data movement.

3. Key Storage Technology: NVMe (Non-Volatile Memory Express)

NVMe storage is a cornerstone in AI SuperClusters due to its exceptional speed and efficiency. NVMe drives leverage high-speed PCIe interfaces, providing hundreds of thousands of IOPS, throughput rates of several gigabytes per second, and latency as low as 20 microseconds. This performance is vital for parallel GPU training, which requires rapid and low-latency data retrieval.

NVMe in NVIDIA DGX Systems

The NVIDIA DGX H100 features a built-in NVMe SSD storage array connected via PCIe Gen5, providing high-speed local storage. This storage is primarily used for caching training data, intermediate results, and model checkpoints, enabling rapid access to frequently used information. While not directly integrated into the GPU memory fabric (HBM), the system leverages NVLink high-speed interconnects for efficient data transfer between CPUs and GPUs within a node, and between nodes in a multi-node configuration.
The NVIDIA GB200 NVL72 utilizes a distributed NVMe storage architecture where NVMe drives are directly attached to each compute node for rapid data access. These nodes leverage high-speed NVLink interconnects to facilitate extremely fast communication and data transfer between GPUs and CPUs, both within and across nodes.

NVMe Integration with External Storage Solutions: DGX systems can access external NVMe storage arrays using the NVMe-over-Fabrics (NVMe-oF) protocol. This extends NVMe's high-speed, low-latency performance over a network using RDMA (Remote Direct Memory Access) over Ethernet or InfiniBand. This method aligns with the data-aware job scheduling discussed in Article 14, which emphasizes the strategic placement of data to optimize access times.

Vendors and Products: Solutions like Dell EMC PowerStore, and NetApp AFF A-Series provide high-speed NVMe storage options. These products offer features like DirectFlash technology, integrated data management tools, and support for NVMe-oF architecture, enhancing the training pipeline's performance and efficiency.

4. Storage Architectures: SAN and Distributed File Systems

In addition to NVMe and in-memory caching, AI SuperClusters rely on sophisticated storage architectures to facilitate data access and distribution.

Storage Area Networks (SANs)

SANs provide block-level storage with low latency and high throughput, critical for multi-node AI training environments. Using NVMe-over-Fabrics (NVMe-oF), SANs can extend NVMe's performance across the network, supporting large-scale data transfer.

Solutions like Dell EMC PowerMax and HPE Primera support NVMe-oF and provide fast, scalable storage solutions tailored for AI training workloads.

NVIDIA DGX systems connect to SANs using high-speed network interfaces, such as Mellanox InfiniBand adapters, accessing NVMe storage arrays over the network

Distributed File Systems

Distributed file systems, such as IBM Spectrum Scale (GPFS) and Lustre, facilitate shared access to large datasets across thousands of nodes. These systems handle massive I/O workloads with low latency, supporting data loading, checkpointing, and synchronization in multi-node AI training setups. The checkpointing techniques covered in Article 15, including incremental checkpointing, benefit from the capabilities of distributed file systems to offload data effectively.

NVIDIA DGX systems mount distributed file systems through NICs and InfiniBand interfaces, using software like NVIDIA GPUDirect Storage to enable direct data transfer between storage and GPU memory, bypassing the CPU to reduce latency.

Network Interfaces and Storage Expansion

NVIDIA DGX systems use InfiniBand adapters (200 Gb/s+) for ultra-fast data transfers between systems and storage. High-speed Ethernet (10/40/100 GbE) connections also facilitate access to NAS systems or distributed file systems over NFS. At the rack level, integrating additional NVMe storage appliances (e.g., Pure Storage FlashArray) expands NVMe storage, providing high-speed access to all nodes within a cluster.

5. Caching, Prefetching, and Data Movement

Caching and Prefetching

Caching strategies utilizing NVMe drives strike a balance between the high cost of RAM and the need for fast access speeds. Intelligent prefetching predicts data access patterns and preloads data into faster storage tiers (NVMe, RAM) to ensure smooth data availability during training operations like All-Reduce. This approach minimizes waiting times for data and supports the parallel training techniques and 3D parallelism approaches discussed in Article 15.

Data Movement and Orchestration

High-performance storage systems facilitate seamless data movement between tiers (e.g., NVMe, RAM) based on usage patterns. Orchestrating data in this way is essential for optimizing training workflows, ensuring rapid checkpointing, and integrating distributed data management strategies, echoing the data locality principles from Article 14.

6. Emerging Storage Technology: NVMe-over-Fabrics (NVMe-oF)

NVMe-oF extends the performance of NVMe storage over a network, allowing remote NVMe devices to be accessed by compute nodes as if they were local. This technology provides low-latency, high-throughput access to external NVMe storage, enhancing the flexibility and scalability of storage resources within a SuperCluster environment. Note: NVMe-oF does not replace NVMe within the rack but instead extends and complements the capabilities of NVMe storage in the data center.

By utilizing NVMe-oF, NVIDIA systems like the DGX H100 or GB200 NVL72 can tap into centralized NVMe storage pools housed in separate storage nodes. This setup bypasses the need for extensive local NVMe storage within each compute node, allowing for more efficient distribution and utilization of storage resources across the entire cluster.

Interface: NVMe-oF enables DGX systems to connect to external storage using high-speed network interfaces, such as InfiniBand adapters or 100/200 GbE Ethernet. Using RDMA (Remote Direct Memory Access), DGX systems can perform high-speed reads and writes to remote NVMe SSDs with minimal latency, effectively treating these external devices as an extension of their local storage.
Vendors: Pure Storage, NetApp, and Dell EMC provide NVMe-oF-enabled storage solutions, offering centralized, high-speed storage arrays that can be accessed by multiple compute nodes across the network.

NVMe-oF addresses the need for scalable, low-latency access to storage in distributed training environments. It allows AI workloads to scale without being limited by the internal storage capacity of individual compute nodes.

Hyperscalers like AWS, Microsoft Azure, and Google Cloud have incorporated NVMe-oF into their cloud offerings, providing customers with fast, network-attached NVMe storage options that can support large-scale AI training workloads.

Conclusion

High-performance storage systems are essential for AI SuperClusters, providing the infrastructure necessary to manage large-scale datasets and ensure smooth, efficient training. Technologies such as NVMe storage, SANs, and distributed file systems, complemented by emerging solutions like NVMe-oF address the challenges of I/O throughput, latency, and scalability.

This article connects the storage strategies discussed in Article 14 on scaling data management with the training optimization techniques in Article 15, offering a comprehensive view of how storage plays a crucial role in multi-trillion parameter model training. The upcoming Article 17, “Orchestrating Training”, will explore orchestrating AI training at scale, examining how storage, data movement, and resource allocation are managed dynamically to maintain training efficiency in real-time environments.

SUPERCLUSTER

Discussion about this post