Presented by Supermicro
The team at xAI, partnering with Supermicro and NVIDIA, is building the largest liquid-cooled GPU cluster deployment in the world. It’s a massive AI supercomputer that encompasses over 100,000 NVIDIA HGX H100 GPUs, exabytes of storage and lightning-fast networking, all built to train and power Grok, a generative AI chatbot developed by xAI.
The multi-billion-dollar data facility, based in Memphis, TN went from an empty building, without any of the necessary power generators, transformers or multiple hall structure to a production AI supercomputer in just 122 days. To help the world understand the extraordinary achievement of the xAI Colossus cluster, VentureBeat is excited to share this exclusive detailed video tour, made possible by Supermicro, and produced by ServeTheHome.
pic.twitter.com/eVdrVdi3b8 [pic.twitter.com]
— Supermicro (@Supermicro_SMCI) October 25, 2024
Here is a run-down of the highlights for this massive undertaking.
Inside the data hall
When you set out to build the largest AI supercomputer, it is clear from the start that a massive amount of computing power will be needed. It must be ready to install and become operational on day 1. And the overall solution will need to be customized to xAI’s unique requirements.
The design starts out using a fairly common raised floor data hall, with power located above, and liquid cooling pipes leading to the facility chiller below. Each of the four compute halls has about 25,000 NVIDIA GPUs, plus all the storage, fiber optic high-speed networking, and power built in.

From there things get more specialized. Every cluster contains the basic building block for Colossus: the Supermicro liquid-cooled rack. Each rack contains eight Supermicro 4U Universal GPU systems which include liquid-cooled NVIDIA HGX H100 8-GPUs and two liquid-cooled x86 CPUs. Each rack thus contains 64 NVIDIA Hopper GPUs. Eight of these GPU servers, plus a Supermicro coolant distribution unit (CDU) and coolant distribution manifolds (CDM) make up one of the racks. The racks are arranged in groups of eight with 512 GPUs, plus a networking rack to provide miniclusters within the much larger system.
The xAI Colossus Data Center Supermicro 4U Universal GPU Liquid-Cooled Servers are the most dense and advanced AI servers available today, with a sophisticated liquid cooling system and the ability to be serviced without removing the systems from the rack.

Next-level liquid-cooled server and rack design
The horizontal 1U rack coolant distribution manifold (CDM) above each server brings in cool liquid and transports out the warmed liquid; quick disconnects make it fast and simple to remove or reinstall the liquid cooling equipment one-handed to reveal the two bottom trays. The rack features eight of the Supermicro 4U Universal GPU Systems for Liquid-Cooled NVIDIA HGX H100 and HGX H200 Tensor Core GPUs. Each of the system’s top tray holds the NVIDIA HGX H100 8-GPU complex and the cold plates on the NVIDIA HGX board to cool the GPUs. The bottom tray holds the motherboard, CPU, RAM and PCIe switches, and cold plates on dual socket CPUs.
Uniquely, Supermicro’s motherboard in the bottom tray integrates the four Broadcom PCIe switches used in almost every NVIDIA HGX server today on the right-hand side of the board, instead of putting these switches on a separate board. And unlike other AI servers in the industry, which add liquid-cooling to an air-cooled design after it’s manufactured, Supermicro’s servers are designed from the ground up to be liquid-cooled with a custom liquid-cooling block. This kind of compact power, accessibility and serviceability make these systems incredibly scalable and singularly set Supermicro apart in the industry.

Plus, each of the CDUs has its own monitoring system to keep tabs on flow rate, temperature and other critical functions, which tie into the central management interface. Each CDU has redundant pumps and power supplies so that if one fails, it can be serviced or replaced in minutes without interrupting the running system.
The Supermicro servers still use system fans to cool components like DIMMs, power supplies, low-power baseboard management controllers, NICs and other electronics. To keep each rack cooling neutral, the server fans pull cooler air from the front and exhaust the warmer air through liquid-cooled rear door heat exchangers. Extra heat is removed from Supermicro’s liquid cooled GPU servers as well as from storage, CPU compute clusters and networking components. The amount of power that the fans use is greatly reduced compared to an air-cooled server, resulting in lower power needed for each server.
Networking the Colossus
The data center’s gargantuan networks run on the NVIDIA Spectrum-X Ethernet networking platform, and it’s being used in the data center in order to scale to its massive AI clusters to an extent no other technology can touch. Spectrum-X is a cutting-edge networking platform that provides fast and reliable data transfer designed to handle the high demands of AI workloads. It offers features like smarter routing of data, reduced delays and better control of network traffic. It also includes enhanced AI fabric visibility and monitoring, making it ideal for large AI projects in shared infrastructure environments.
Each cluster uses NVIDIA Bluefield-3 SuperNICs, which provides 400 gigabit per second networking. It’s the same base technology that any desktop ethernet cable uses, but in the data center, it’s 400GbE, or 400 times faster per optical connection. Nine links per system offer 3.6Tbps of bandwidth per GPU compute server. The RDMA ((Remote Direct Memory Access) network for the GPUs makes up the majority of this bandwidth. Each GPU is paired with its own NVIDIA BlueField-3 SuperNIC and Spectrum-X networking technology.

Beyond the GPU RDMA network, the CPUs also get a 400GbE connection, which uses a different switch fabric entirely. xAI is running a network for its GPUs and one for the rest of the cluster, which is a very common design point in high-performance computing clusters.
The NVIDIA Spectrum SN5600, a 64-port 800Gb ethernet switch, can split and run 128 400-gigabit ethernet links to ensure the NVIDIA GPUs and the entire cluster runs and scales at maximum performance levels. It can offload various security protocols, uses advanced flow management to avoid a congested network and handle all the CPU supercomputer tasks in the cluster, in one of the first deployments for this type of switch in the world.
All told, this massive undertaking eclipses in scale any supercomputer attempted before. We’ll be watching as xAI, together with Supermicro and NVIDIA, continue to push boundaries even further in a new era of supercomputing.
Get a good look inside Colossus! Don’t miss the detailed walk-through in the video above.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact