Collective Operations and AI Infrastructure
Performance of AI clusters, especially ML training clusters, is greatly influenced by ability to scale computations among large number of identical neural processing units (NPUs) – a common name for GPUs or TPUs. With these so-called scale-up and scale-out designs, properties of the network between the NPUs become one of the key factors defining the performance of the overall system. During AI model training, the NPUs need to periodically exchange data among themselves to allow the training job to progress through the layers of the AI model or to adjust the model weights to better predict the outcomes.
Depending on the model partitioning schema and specifics of the ML job, the requirements for the data exchanges can vary. In general, the movement of data among NPUs is called a Collective Operation. Theory of Collective Operations has formalized them into several types, depending on initial and final location of data, and if there is a need to perform a mathematical function on it during the process. Commonly used types are Broadcast and Gather, ReduceScatter and AllGather, AllReduce and AlltoAll. The presence of the Reduce keyword in the name of the operation signifies that this operation performs computations on the data.
Collective Operations Algorithms
A collective operation can be implemented using any number of algorithms. Some of them are more naïve and as a result have low performance, others take advantage of properties of the operation, and often network topology, and complete much faster. Well known algorithms for AllReduce are unidirectional and bidirectional ring, double binary tree and halving-doubling, and each demonstrates better or worse performance depending on the number of NPUs, and how they are interconnected.
Figure 1. AllReduce Unidirectional Ring between 8 ranks
Figure 2. AllReduce Halving-Double between 8 ranks
Ranks and Collective Size
Each endpoint exchanging data messages in a collective operation is called a Rank. In most practical cases, each NPU involved in the operation typically has one rank. The total number of ranks in the collective is called Collective Size and usually noted as n. Ranks are sequentially numbered with integer IDs, starting with id = 0
. Thus, the maximum rank id is n - 1
. Collective operations like Broadcast and Gather with a dedicated sender or receive called a root rank, use rank id 0
as root by default.
Collective Communications Library
The software that implements collective operations in AI clusters is usually distributed under the name Collective Communication Library. One of the original libraries – NCCL (pronounced nickel) – was developed by NVIDIA. There are multiple forks of NCCL, some are public like RCCL, while others are proprietary.
Collectives Benchmark
Because of how much the performance of scale-out AI clusters depends on the networking, it is paramount to be able to measure and compare performance of collective operations. Collectives Benchmark is a methodology that allows us to perform, and report results of such measurements in a repeatable and well-understood manner. There are multiple tools that can perform Collectives Benchmarking. Some of them are open source: nccl-tests, OSU Micro-Benchmarks. Keysight provides commercial implementation via Keysight AI Collective Benchmarks app in the AI Data Center Validation Solution software.
Collective Completion Time
The faster a cluster can complete a Collective Operation, the sooner the job will be able to continue with the next cycle of computations on the expensive NPUs. Ultimately, the completion time of the operation is what we look to improve. We will call this Collective Completion Time (CCT
) and measure it in seconds (s).
Data size
Since the objective of a collective operation is to move data, the size of the data will greatly influence how long the operation will take. The Collectives Benchmark methodology uses the size of the data one rank has as input to the operation as the Data Size S
, measured in bytes (B).
Depending on the operation and implementation, the collective algorithm would split the data size S
into algorithmic data chunks of size c
, and each chunk would follow a specific path as it moves among the ranks of the collective. The notion of chunks is very useful for expressing and understanding the logic of the collective operations algorithms.
The data size S parameter has a few notable properties:
- For most collective operations,
S
is the same among all the ranks. But there are exceptions: in Broadcast, only one rank has input data. In AlltoAll-v the data sizeS
can be different on each NPU. - The amount of payload
D
that each rank will send over the network quite often is not the same asS
. Depending on the collective operation and the algorithm used to implement it,D
may be less or more thanS
. For example, in AlltoAll Parallel, each rank will keep one chunkc = S/n
to itself and sendD = S * (n – 1)/n
over the network. On the other hand, with AllReduce Ring, each rank will send twice the amount of dataD = 2 * S
, as it is a composite operation of ReduceScatter (D = S
) followed by AllGather (D = S
).
Data sizes used in the collective operations in real AI/ML training jobs depend on multiple factors within the same job and between the jobs. Since the AI cluster infrastructure has to support ever changing jobs during its lifespan, we need to understand how performance of the collectives varies depending on the data size. This is the main reason the Collectives Benchmark methodology sweeps through the range of data sizes to measure key metrics for each value of S
.
Final note on the S
. In AI clusters, most operations move data that physically resides in the memory attached to the NPUs, rather than the system (CPU) memory. Thus, we can establish upper bounds of S
that make practical sense to benchmark, by checking the amount of memory of modern NPUs. As a point of reference, NVIDIA H100 SXM has 80GB of memory. As this memory is shared between AI model weights, training data and gradients, in this case a practical limit for the data size could be 32GB.
Algorithm Bandwidth
Researchers in the area of collective operations keep discovering new algorithms that significantly improve Collective Completion Time, quite often thanks to understanding of the underlying network topology – topology awareness. Since CCT
directly depends on the Data Size S
, and S
is job-specific, it helps to introduce a new metric to measure the algorithm performance in a way that could be compared between different jobs. Similar to how we compare performance of a car by using speed rather than the travel time. Such metric is the Algorithm Bandwidth (algbw
). It is defined as S
over CCT and measured in gigabytes per second (GB/s).
Note, that even though airplanes have higher average speed than cars, we don’t use them for short distance travel due to waiting time. Similarly, when the Data Size is so small it can’t fully load the network, a significant portion of the CCT
will be spent on starting and stopping data transmissions. In such cases, the algbw
will not be a meaningful metric as it will vary significantly between different Data Sizes. You might want to go back to using CCT
when comparing which algorithm performs better.
For illustration purposes, Figure 3 is a benchmark output where on the smaller data sizes the CCT
stays around 7ms
but the algbw
doubles with every data size doubling. On the other hand, on the larger sizes the algbw
stabilizes towards 25GB/s
while the CCT
continues to increase with the size.
Figure 3. Behavior of algbw with varying data sizes
Note: it would be a mistake to compare algbw
with network interface speed, due to the difference between the algorithm-specific payload D
and the data size S
.
Bus Bandwidth
While it is straightforward to see how CCT
and algbw
depend on the Data Size, influence of Collective Size n
is less obvious. Depending on the algorithm, that dependency can be more or less direct. For example, with AlltoAll Parallel, increasing number of ranks in the collective while keeping the same Data Size results in algorithmics data chunks c = S / n
becoming smaller. Therefore, a larger fraction of bandwidth would be wasted on packet headers and the CCT
would increase approximately by the same fraction. In contrast, with AllReduce Ring, the more ranks are in the collective, the more hops each data chunk has to traverse to complete the ring, resulting in linear increase of the CCT
. We need a metric that would characterize performance of an AI cluster independent on the size of the ML job.
For that, we can imagine that movements of the data chunks defined by the collective algorithm are similar to the cars moving across a city, where parking garages, driveways, roads and intersections represent memory, NICs, wires and switches, respectively, and the cars are data chunks. The only real stretch here when comparing to the city streets is that all the data chunks travel across exactly the same pattern, and don’t get distracted. With this analogy we can make a pretty solid guess that the time it takes for a car to reach its final destination will be largely impacted by the slowest segment along its path – a bottleneck. The moment that bottleneck reaches its capacity, doubling the number of cars that have to cross it will double the time it takes for them to pass. In other words, the peak throughput of the bottleneck does not depend on the number of parking garages in the city (Collective Size n
) or number of cars (Data Size S
).
Another useful property of this analogy is that while the peak bottleneck throughput does not depend on the adjacent streets and patterns of the car movement (topology and algorithm), the time it would take for the all the cars to complete the trips would be very much determined by the routes they take, and how much of them are in the city.
In summary, as long as the cars (data chunks) have to cross the bottleneck in numbers that peak its capacity (bandwidth), it will determine performance of our city’s infrastructure (AI cluster) – we can’t move payloads faster than the bottleneck allows.
The metric that describes the bottleneck of the AI infrastructure for a Collective Operation is called Bus Bandwidth (busbw
) and measured in gigabytes per second (GB/s). We need a more formal approach than our city streets analogy that would allow us to establish a method for measuring busbw
. We plan to publish a separate article dedicated to this topic. The formulas for busbw
can be found in the nccl-tests documentation.
Additional Metrics
Comparisons with Ideal CCT
Some collectives benchmarking tools don’t have visibility into underlying L1-4 OSI stack to tell what the network utilization during the testing was and if there is room for improvement. When creating the AI Data Center Builder software, the Keysight team incorporated insights into the lower layers to address this gap. The Keysight AI Collective Benchmarks app calculates the Ideal CCT
of each individual collective algorithm and compares the measured CCT
with the Ideal value. As a result, the data produced by the Keysight AI Collective Benchmarks includes comparison to the Ideal CCT
value in the form of the Ideal %
metric.
Figure 4. Keysight AI Collective Benchmarks Summary with Ideal % and DCT
Distribution of Data Chunk Completion Times
Many collective algorithms demonstrate symmetry in data chunk movements by each rank. This leads to an important conclusion – the best performance is observed on systems with equal allocation of bandwidth towards each data chunk, otherwise known as bandwidth fairness. Lack of fairness causes long tail latency of data movements. One way to benchmark fairness is to measure each Data chunk Completion Time (DCT
) and report Min
, Max
, as well as P50
and P95
percentile values. In a system with bandwidth fairness min and max DCT
values would be close. When they are not, you could understand if you had more outliers on the fast or slow side by checking P50
and P95
results. These metrics are quite familiar to network engineers and easier to use across teams and organization than other fairness metrics found in academic papers.
Note: reporting DCT
percentiles is meaningful only when sizes of all data chunks in the collective algorithm are equal.
Conclusion
Benchmarking of Collective Operations is a foundational methodology for understanding performance limits of distributed AI infrastructure. It is a useful tool in the search for improvements both during design as well as optimization of the AI clusters. With various implementations, both open-source and commercial, there is a common set of input parameters and measured metrics these tools operate. In this article, we provided definitions and expanded on meaning of the parameters in an effort to help with standardization of the terminology.
Summary of Terms and Definitions
Parameter
Definition
Unit / Values
Collective Operation
Communication patterns that involve a group of processes to exchange data. Elemental blocks of network communications in scale-out AI/ML clusters that move data between the GPUs.
Broadcast
Gather
Scatter
ReduceScatter
AllGather
AllReduce
AlltoAll
Rank
Identifiers of endpoints exchanging messages in the collective operation. Often, a rank represents a GPU. In some cases, a GPU can have many ranks.
Numbers starting with 0
Collective Size (n
)
Number of ranks in the collective operation.
2 or more
Data Size (S
)
Size of the data one rank has as input to the operation.
Bytes
Collective Completion Time (CCT
)
Time it takes for Collective Operation to complete, useful to compare Collective Algorithm performance on small data sizes.
Seconds (s)
Algorithm Bandwidth (algbw
)
A metric to compare Collective Algorithm performance on large data sizes. algbw = S / CCT
Gigabytes per Second (GB/s)
Bus Bandwidth (busbw
)
Characterizes the performance of a bottleneck in the AI infrastructure for a Collective Algorithm. The formula is algorithm specific.
Gigabytes per Second (GB/s)
Ideal %
Comparison of measured CCT to the minimum theoretical value for a given algorithm, transport overhead and network interface speed.
%
Data chunk Completion Time (DCT
)
Time it takes to transfer a data chunk between two ranks. Measuring each DCT
value for a collective operation and reporting Min
, Max
, P50
and P95
values helps understand bandwidth fairness in the system.
Seconds (s)