6/20. AI Supercluster: NVIDIA DGX Power, Cooling, and Efficiency
Introduction
This article provides a comprehensive overview of the power and cooling requirements for the NVIDIA DGX H100. We will explore:
Power system requirements and specifications
Power distribution and supply
Cooling systems
Power Usage Effectiveness (PUE) and Performance Efficiency
Understanding these aspects is essential for effectively deploying and managing DGX H100 systems in data center environments.
Power System and Requirements
The NVIDIA DGX H100 is engineered to deliver exceptional performance for AI and high-performance computing (HPC) workloads, necessitating a robust power delivery system. Understanding its power requirements is crucial for proper integration into data center environments.
Power Input Specifications
The DGX H100 is designed to connect directly to the data center's power infrastructure:
Input Voltage: 200-240V AC, 50/60 Hz
Maximum Power Consumption: Up to 10.2 kW at full load
Typical Power Consumption: Between 6.5 and 8.5 kW, varying based on workload intensity
Power Connections: Four separate power inputs, one for each internal 3000W power supply
Note: The DGX H100 does not come with its own Power Distribution Units (PDUs). Data center operators are responsible for providing appropriate power distribution infrastructure.
Internal Power Supply System
4x 3000W Internal Power Supply Units (PSUs): The DGX H100 is equipped with four 3000-watt power supplies housed within its 4U chassis. These work in a redundant configuration, automatically shifting load between remaining power supplies in case of failure, ensuring zero downtime.
Redundant Configuration: These internal PSUs are configured for redundancy, ensuring operational continuity even if one PSU fails.
Hot-Swappable Design: The internal PSUs can be replaced or serviced while the system is running, minimizing downtime for critical AI workloads.
Power Distribution and Supply in DGX H100 Deployment
The power infrastructure for DGX H100 systems involves connecting the data center power distribution to the DGX H100 system's internal power supplies. Data centers typically use three-phase power, consisting of three alternating currents offset by one-third of a cycle, for its efficiency and high power density.
Power Distribution Units (PDUs) in the data center distribute the three-phase power and provide appropriate single-phase 200-240V AC outlets for each of the DGX H100's four 3000W Power Supply Units (PSUs). These PSUs then convert the AC input into the DC power required by the DGX H100's internal components.
The power flow follows this path: Main Data Center Power (three-phase) → PDU → DGX H100 PSUs → DGX H100 Internal Components. For optimal redundancy and load balancing, the four PSUs of a DGX H100 should ideally connect to at least two separate PDUs, with their loads distributed across the three power phases. This configuration ensures fault tolerance at both the data center level (through redundant PDUs) and the system level (through multiple PSUs).
DGX H100 Cooling System
The NVIDIA DGX H100 employs a sophisticated hybrid cooling system to manage its significant heat output efficiently. This system combines direct-to-chip liquid cooling and traditional air cooling, working in tandem to ensure optimal thermal management across all components.
The primary method for cooling high-heat components, especially the GPUs, is direct-to-chip liquid cooling. This advanced system is capable of handling up to 5,600W of heat dissipation from the GPUs alone. It utilizes cold plates mounted directly on the GPUs, ensuring efficient and immediate heat transfer from these critical components.
Complementing the liquid cooling, the DGX H100 incorporates an air cooling system, with a capacity of up to 4,600W, uses high-efficiency fans positioned at the front of the system. These fans pull cool air through the chassis to maintain safe operating temperatures for non-GPU components such as CPUs, RAM, and storage."
The DGX H100's thermal management is further enhanced by several intelligent features. Dynamic fan speed control adjusts operation based on system load and internal temperatures, optimizing cooling efficiency and reducing unnecessary noise and power consumption. Continuous temperature monitoring of key components ensures that the system always operates within safe thermal limits. In extreme conditions, the DGX H100 can employ thermal throttling, adjusting performance dynamically to maintain safe operating temperatures.
Data Center Cooling Integration
Integrating the DGX H100's advanced cooling system with existing data center infrastructure presents both challenges and opportunities for optimization. Many existing data centers are not originally designed to accommodate liquid cooling systems, necessitating significant modifications to incorporate the required plumbing, heat exchangers, and coolant distribution units.
One of the primary challenges is implementing a coolant distribution system that can efficiently serve multiple DGX H100 units while maintaining proper flow rates and temperatures across all systems. This often requires careful planning and may involve the installation of new, dedicated cooling loops within the data center.
Heat rejection is another significant consideration. The substantial heat load generated by a cluster of DGX H100 systems may necessitate upgrades to existing chiller systems or the addition of cooling towers or dry coolers. These upgrades must be carefully planned to ensure they can handle both the current heat load and potential future expansions.
Power Usage Effectiveness (PUE) and Performance Efficiency
In the realm of high-performance computing and data center operations, two critical metrics stand out: Power Usage Effectiveness (PUE) and Performance Efficiency. While distinct in their focus, both play crucial roles in evaluating and optimizing the overall efficiency of systems like the NVIDIA DGX H100.
Understanding Power Usage Effectiveness (PUE)
Power Usage Effectiveness is the cornerstone metric for assessing data center infrastructure efficiency. It provides insight into how effectively a facility delivers power to its IT equipment. Calculated as the ratio of total facility energy to IT equipment energy, PUE offers a clear picture of a data center's overhead energy consumption.
A perfect PUE of 1.0 would indicate that all energy consumed by the facility is used directly by IT equipment - an ideal scenario rarely achieved in practice. Most data centers operate with PUE values between 1.1 and 3.0, with lower values signifying better efficiency. For instance, a PUE of 1.5 implies that for every 1.5 watts of power drawn by the facility, 1 watt reaches the IT equipment.
Several factors influence PUE, with cooling systems often being the largest contributor to overhead energy use. Efficient power delivery systems and innovative cooling solutions can significantly impact a data center's PUE. However, it's crucial to understand that PUE does not reflect the computational efficiency of the IT equipment itself; it solely measures the efficiency of power delivery to that equipment.
Decoding Performance Efficiency
While PUE focuses on facility-level efficiency, Performance Efficiency zeros in on the computational output of IT equipment relative to its power consumption. For high-performance systems like the DGX H100, this is typically quantified in TFLOPS/kW (Tera Floating Point Operations Per Second per Kilowatt).
Performance Efficiency provides a direct measure of an IT system's computational capabilities per unit of power consumed. Higher TFLOPS/kW values indicate greater computational power for the energy invested. Importantly, the relationship between power consumption and computational output is not linear, leading to varying efficiency at different power levels.
The DGX H100 showcases this nuanced efficiency profile:
At maximum power consumption (10.2 kW): Up to 784 TFLOPS/kW for AI workloads (FP8)
In typical usage scenarios: Efficiency can often be higher, potentially between 941 and 1,230 TFLOPS/kW
This non-linear efficiency curve demonstrates that the DGX H100 often achieves its best performance efficiency at power levels below its maximum. This characteristic aligns with the system's energy-proportional design, allowing it to optimize efficiency across various workload intensities.
Energy-proportional design ensures that the system's power consumption scales dynamically with its computational load, taking advantage of the higher efficiency often found at lower power levels. For less intensive tasks, the DGX H100 doesn't unnecessarily draw maximum power, instead adjusting its energy use to match the demands of the current workload. This dynamic adjustment significantly improves overall energy efficiency and reduces power waste during periods of lower computational intensity.
Specific features of the DGX H100 that contribute to its energy-proportional design include:
Dynamic voltage and frequency scaling (DVFS) of its GPUs and CPUs
Intelligent power management that can selectively power down unused components
Adaptive cooling system that adjusts based on current thermal output
These features allow the DGX H100 to maintain high performance efficiency (TFLOPS/kW) across a wide range of workloads, from low-intensity tasks to full-scale AI training operations.
The Interplay Between PUE and Performance Efficiency
While PUE and Performance Efficiency are separate metrics, they both contribute significantly to overall data center efficiency. Improved Performance Efficiency means more computation per watt of IT equipment energy, while better PUE ensures more of the facility's energy reaches the IT equipment.
High-performance, efficient systems like the DGX H100 can indirectly contribute to PUE improvements. By generating less heat for a given computational output, these systems can reduce cooling requirements, potentially lowering the total facility energy consumption. However, it's important to note that this is an indirect effect; PUE itself remains a measure of power delivery efficiency, not computational efficiency.
The synergy between PUE and Performance Efficiency becomes evident in modern data center design and operation. By focusing on both metrics, facilities can:
Maximize computational output
Minimize overall energy consumption
Optimize cooling and power delivery systems
Achieve a balance between infrastructure efficiency and computational power
In conclusion, while PUE provides insights into facility-level efficiency, Performance Efficiency metrics like TFLOPS/kW offer a window into the capabilities of individual systems like the DGX H100. Optimizing both metrics is key to achieving maximum computational power with minimal energy footprint.
DGX A100 vs. DGX H100: Power and Performance Efficiency Comparison
Power Consumption
DGX A100: Maximum power consumption of 6.5 kW
DGX H100: Maximum power consumption of 10.2 kW
Difference: The DGX H100 consumes about 57% more power at peak load
Performance Efficiency
DGX A100: Approximately 52 TFLOPS/kW (FP16)
DGX H100: Approximately 78 TFLOPS/kW (FP16)
Improvement: The DGX H100 offers about 50% better performance per watt
Overall Efficiency Gain
While the DGX H100 uses more power, its performance efficiency gain means it can deliver significantly more computational power for a given energy input.
This efficiency is crucial for data centers aiming to maximize computational output while managing power and cooling constraints.
Real-World PUE Considerations for DGX H100 Deployments
Cooling Efficiency
The DGX H100's liquid cooling system can handle up to 5,600W of GPU heat more efficiently than air cooling.
This can lead to a cooling power reduction of up to 30% compared to traditional air cooling for the same heat load.
Power Supply Efficiency
DGX H100 power supplies are up to 94% efficient at typical loads.
This high efficiency can help maintain a lower PUE by reducing waste heat from power conversion.
Workload Density
A single DGX H100 can replace multiple previous-generation systems for the same workload.
While individual system power consumption is higher, overall data center efficiency may improve due to reduced cooling and infrastructure needs per unit of computation.
Conclusion
The NVIDIA DGX H100 represents a significant leap in AI and HPC capabilities, but it also presents unique challenges and opportunities for data center infrastructure. Key takeaways from this overview include:
The DGX H100's power requirements are substantial, with a maximum consumption of 10.2 kW per unit, necessitating robust power distribution systems.
Despite higher power draw, the DGX H100 offers improved performance efficiency at 78 TFLOPS/kW, a 50% increase over its predecessor.
The hybrid cooling system, combining direct-to-chip liquid cooling for GPUs and air cooling for other components, is crucial for managing the system's thermal output.
Integrating DGX H100 systems into existing data centers requires careful planning, particularly in adapting to liquid cooling infrastructure.
By addressing these aspects holistically, AI Infrastructure Engineers can fully leverage the DGX H100's capabilities while maintaining efficient and reliable data center operations.
Looking ahead to our next article, "Article 7: Evolution of NVIDIA DGX Platforms," we'll explore how the DGX has evolved from generation to generation.