What Is an AI Computing Cluster? Key Components Explained

In recent years, the AI wave has swept the world and become the focus of attention across society. When people discuss AI, they often mention AI computing clusters. The three core elements of AI are computing power, algorithms, and data. The AI computing cluster is currently the most important source of computing power. It functions like a super power plant, providing a steady stream of energy for the AI boom.

So what exactly does an AI computing cluster consist of? Why can it deliver such powerful computing performance? What is its internal structure? What key technologies does it involve?

Table of Contents

What Is an AI Computing Cluster?

An AI computing cluster, as the name implies, is a cluster system that provides computing power for AI tasks. A “cluster” refers to a group of independent devices connected via a high-speed network.

There is also a definition online that describes an AI computing cluster as “a distributed computing system formed by interconnecting a large number of high-performance computing nodes (such as GPU or TPU servers) through a high-speed network.”

Understanding AI Computing Power: From Chips to Clusters

AI computing mainly involves two tasks: training and inference. Training requires massive calculations, higher difficulty, and powerful computing power. Inference, by comparison, needs less computation, lower difficulty, and less power.

Both training and inference involve heavy matrix operations, such as convolution and tensor multiplication. These tasks can be split into independent subtasks for parallel processing. That’s why chips designed for parallel computing like GPUs, NPUs, and TPUs have become the main tools for AI computing. These are collectively known as AI chips.

AI chips are the basic units that provide AI computing power. A single chip alone isn’t enough; it must be integrated onto a circuit board.

When AI chips are embedded into smartphone motherboards or built into the SoC main chip, they provide AI computing power to the device. Similarly, integrating them into IoT modules enables AI power for cars, robotic arms, AGVs, cameras, and other smart devices. These are examples of end-side computing.

When AI chips are built into base stations, routers, and gateways, they provide edge computing power.

These devices are typically small, with just one AI chip and limited computing power. They’re mainly used for basic inference tasks.

To handle more complex training tasks, we need a hardware platform that can host more AI chips.

This is done by installing AI chips onto computing boards, placing multiple boards into a server, and creating what we call an AI server. Technically, there’s no such thing as a special AI server from the ground up, just regular servers equipped with AI boards.

Typically, AI servers host about eight boards, though some configurations go up to twenty. However, due to heat and power constraints, going beyond that is rarely practical.

At this stage, the computing power of the AI server increases significantly. It’s ideal for inference tasks and can handle smaller training jobs.

Thanks to DeepSeek’s architectural and algorithmic optimizations, large models now require less computing power. As a result, many manufacturers have developed single-rack computing devices, which include several AI servers, storage units, and power supplies. These all-in-one machines meet the private deployment needs of many enterprise users and are selling well.

Whether it is an AI server or an all-in-one machine, AI computing power is still limited. For the training of large models with truly massive parameters, hundreds of billions or even trillions, much more powerful AI computing power is still needed.

That’s why we need to build systems that include many more AI chips. In other words, truly large-scale AI computing clusters.

These days, we often hear terms like “10,000-card scale” or “100,000-card scale”. This refers to building AI computing clusters that include ten thousand or even one hundred thousand AI computing boards or chips.

So, what do we do about this? The answer is simple: Scale Up and Scale Out.

What Is Scale Up?

“Scale” means expansion. Those who have worked in cloud computing are likely familiar with this term.

Scale Up refers to upward expansion, also known as vertical scaling, where you increase the resources within a single node.

Scale Out, or horizontal scaling, means increasing the number of nodes.

In cloud computing, the reverse of Scale Up is Scale Down (vertical reduction), and the reverse of Scale Out is Scale In (horizontal reduction).

As mentioned earlier, putting more AI computing boards into a single server is considered Scale Up. In this context, a server represents a single node.

Connecting multiple computers (nodes) through a communication network is Scale Out. The main difference between the two lies in how the AI chips are interconnected.

Scale Up involves internal node connections. This offers higher connection rates, lower latency, and stronger performance.

Previously, computer components communicated mainly through the PCIe protocol. This protocol dates back to the 1980s and 1990s, when personal computers first gained popularity. Although PCIe has seen upgrades, its speed and latency have struggled to meet modern demands.

To address this, NVIDIA introduced the NVLINK bus protocol in 2014. NVLINK enables GPUs to communicate directly with each other in a point-to-point fashion, delivering much faster speeds and lower latency than PCIe. Initially, NVLINK was only used for internal communication within a machine.

In 2022, NVIDIA separated the NVSwitch chip and transformed it into an NVLink switch, allowing it to connect GPUs across different servers. This advancement meant that a single node could now consist of multiple servers and network devices.

These grouped devices form what’s known as an HBD (High Bandwidth Domain). NVIDIA refers to a Scale Up system with more than 16 GPUs interconnected by ultra-high bandwidth as a supernode.

After years of development, NVLINK has reached its fifth generation. Each GPU now supports 18 NVLINK connections. The total bandwidth of the Blackwell GPU can reach 1800 GB/s, far surpassing PCIe Gen6’s capabilities.

In March 2024, NVIDIA released the NVL72, a liquid-cooled cabinet system that integrates 36 Grace CPUs and 72 Blackwell GPUs. This setup achieves 720 Pflops of AI training performance and 1440 Pflops of inference performance.

NVIDIA is undoubtedly the leader in AI computing. They have the most popular AI chips (GPUs), a dominant software ecosystem (CUDA), and have also explored the most effective way to scale up.

Later, with the continued development of AI, more companies began launching their own AI chips. Because NVLINK is a proprietary protocol, these companies also needed to develop their methods for building AI computing clusters.

AMD, one of Nvidia’s main international competitors, launched UA LINK. In China, companies like Tencent, Alibaba, and China Mobile introduced ETH-X, ALS, and OISA, respectively. These are all open standards. They cost less than proprietary protocols, help lower industry barriers, promote technological equality, and align with the open, decoupled trend of the Internet.

It’s worth noting that most of these standards are based on Ethernet technology (ETH), which is mature, open, and supported by a well-established industrial ecosystem. Another notable technical route is Huawei’s proprietary protocol UB (Unified Bus).

In recent years, Huawei has been actively building the Ascend ecosystem. Ascend is Huawei’s AI chip series, now evolved to the Ascend 910C. To unleash the full power of the 910C and push it into the market, Huawei also needed its own AI computing cluster solution.

In April 2024, Huawei launched the CloudMatrix384 supernode, which integrates 384 Ascend 910C computing cards and provides up to 300 Pflops of dense BF16 computing power—nearly double the performance of Nvidia’s GB200 NVL72 system. CloudMatrix384 is powered by UB technology. Specifically, it includes three different network planes: the UB plane, the RDMA plane, and the VPC plane.

These three planes complement each other, enabling strong inter-card communication within CloudMatrix384 and enhancing the computing power of the entire supernode. Due to space limitations, more detailed technical specifications will be introduced separately next time.

It’s worth noting that, facing increasing competition from open standards, Nvidia announced the NVLink Fusion Plan not long ago. This plan opens NVLink to eight partners to help them build customized AI systems by connecting multiple chips together.

However, according to some media reports, some critical components of NVLink remain closed, suggesting that Nvidia is still not entirely open-handed.

What Is Scale Out?

Scale Out is actually quite similar to traditional data communication networks. Technologies like fat-tree architecture, leaf-spine network architecture, TCP/IP, and Ethernet form the basic framework of Scale Out.

Of course, AI computing places much higher demands on network performance, so traditional technologies must be upgraded to meet those expectations.

At present, the two main networking technologies used for Scale Out are InfiniBand (IB) and RoCEv2.

Both are based on the RDMA (Remote Direct Memory Access) protocol and offer significantly higher speed, lower latency, and better load balancing than conventional Ethernet.

InfiniBand was originally created to replace PCIe, but its adoption experienced ups and downs. Eventually, Mellanox, the company behind InfiniBand, was acquired by Nvidia. Today, IB is a part of Nvidia’s private technology stack. Its performance is excellent, but it comes at a high cost and is an essential piece of Nvidia’s overall computing strategy.

RoCEv2, by contrast, is an open standard. It blends traditional Ethernet with RDMA and emerged as an industry-backed alternative to challenge InfiniBand’s dominance. It’s more affordable, and the performance gap between it and IB continues to narrow.

Compared with the multiple standards in the Scale Up field, the current Scale Out standards are relatively concentrated, mainly RoCEv2, and the route is very clear. After all, Scale Up operates within the node and is closely tied to chip products, whereas Scale Out functions outside the node and emphasizes compatibility.

Bandwidth and Latency Differences

As mentioned earlier, the main difference between Scale Up and Scale Out lies in the bandwidth rate.

IB and RoCEv2 can only provide Tbps-level bandwidth, while Scale Up can achieve 10Tbps-level interconnection across hundreds of GPUs.

There’s also a significant latency gap. IB and RoCEv2 have latency as high as 10 microseconds, while Scale Up requires latency in the range of 100 nanoseconds (0.1 microseconds), which is drastically lower.

During AI training, various parallel computing strategies are used—like TP (tensor parallelism), EP (expert parallelism), PP (pipeline parallelism), and DP (data parallelism). Generally, PP and DP involve smaller data exchanges, which are typically handled by Scale Out. In contrast, TP and EP require massive inter-GPU communication, which is more efficiently handled by Scale Up (within the supernode).

Supernodes: The Ideal for Scale Up

Supernodes, the most effective Scale Up solution to date, connect GPUs using high-speed internal buses. They support fast parameter exchange, data synchronization, and significantly reduce the training time for large models.

Another benefit: supernodes typically support memory semantics, allowing GPUs to read each other’s memory directly. Something that’s not possible with Scale Out.

From a network and operations perspective, larger-scale Scale Up also brings clear advantages.

The larger the supernode’s HBD (high-bandwidth domain), the more GPUs are connected internally, and the simpler the Scale Out networking becomes. This significantly reduces network complexity. A Scale Up system is essentially a compact, highly integrated cluster with pre-connected internal buses, which makes deployment faster and easier. Future maintenance is also far more manageable.

That said, Scale Up isn’t limitless; cost remains a key consideration. The ideal size of a Scale Up system depends on specific workload demands and resource availability.

In short, Scale Up vs Scale Out reflects a constant trade-off between performance and cost. As technology evolves, supernodes will continue to grow, and the line between Scale Up and Scale Out will blur.

Interestingly, many of the new open Scale Up standards, such as ETH-X, are built on Ethernet technology. Technically speaking, Ethernet boasts:

The largest commercial switching chip capacity (up to 51.2 Tbps)
The fastest SerDes speeds (up to 112 Gbps)
Low switching chip latency (around 200 nanoseconds)

All these factors make Ethernet well-suited for scale-up performance needs. And since Scale Out also relies on Ethernet, one might say, Isn’t this the great unification?

Development Trend of AI Computing Clusters

Finally, let me talk about some trends in AI computing clusters. At present, AI computing clusters show the following trends:

Delocalization of physical space

AI computing clusters are developing toward 10,000 and 100,000 chips. One rack of Nvidia NVL72 has 72 chips, and 16 racks of Huawei CM384 have 384 chips. Huawei needs 432 sets (384 × 432 = 165,888) of 100,000 chips based on CM384, which means 6,912 racks.

It is difficult for a single data center to accommodate so many racks. Power supply will also become a problem.

Therefore, the industry is exploring the possibility of remote data centers forming AI computing clusters to jointly complete AI training tasks. This is a great test of DCI optical communication technology with long distance, large bandwidth, and low latency, and will accelerate the application of cutting-edge technologies such as hollow-core optical fiber.

Customization of Node Architecture

When we introduced AI clusters, we discussed how to gather a large number of AI chips. In fact, in addition to the number of chips, AI computing clusters are increasingly focusing on the in-depth design of the architecture.

Pooling of computing resources, GPU, NPU, CPU, and even memory and hard disk, has become a trend. The cluster needs to fully adapt to the architecture of large AI models (such as MoE architecture) and provide customized designs to better complete computing tasks.

In other words, it is not enough to simply provide AI chips, but also to provide tailor-made designs.

Intelligent operation and maintenance capabilities

Everyone knows that training large AI models is prone to errors. In serious cases, an error occurs every few hours. If an error happens, the model must be recalculated, which is very time-consuming and not only prolongs the training cycle but also increases the training cost.

Therefore, when enterprises build AI computing clusters, they pay more attention to the reliability and stability of the system. It has become a trend to introduce various AI technologies to predict potential failures and replace sub-healthy equipment or modules in advance.

These technologies help reduce failure and interruption rates, enhance system stability, and, in turn, improve computing power in disguise.

Greening of computing power

AI computing requires a lot of power and generates high energy consumption. Major manufacturers are working hard to reduce the energy consumption of AI computing clusters and increase the proportion of green energy use, which is also conducive to the long-term development of AI computing.

What Is an AI Computing Cluster?