Understanding Communication Patterns in Distributed ML: A Deep Dive into NCCL, MPI, and Gloo

March 14, 2025
Written By Rahul Suresh

Senior AI/ML and Software Leader | Startup Advisor | Inventor | Author | ex-Amazon, ex-Qualcomm

Understand key communication patterns (Broadcast, Scatter/Gather, All-Reduce, All-Gather, All-to-All) and core protocols (NCCL, MPI, Gloo) crucial for efficiently scaling distributed ML workloads, including LLMs, computer vision, and multimodal models.

In my previous article, we covered the foundations of parallel computing, exploring concepts such as SIMD, threading, and GPU kernels. Understanding these fundamentals is essential because distributed ML, especially training massive models, relies heavily on efficiently dividing workloads across multiple GPUs.

But that is only half the story- efficiently communicating and synchronizing the results across all the parallel compute units is equally important!

To illustrate this concept clearly, imagine you’re training a deep learning model on a large GPU cluster. After each training iteration, every GPU computes its own gradients independently. To keep the model synchronized, we must aggregate the gradients across GPUs, by summing them up and averaging the result. This process, known as All-Reduce, happens thousands or millions of times during training. Without optimized communication protocols, each iteration would stall while waiting for data exchanges to complete. This can effectively eliminate the performance gains we obtain from parallelization!

Specialized communication protocols are crucial in distributed ML to perform tasks like gradient aggregation in an efficient and scalable way. In this article, we’ll dive into three of the most influential communication protocols: NVIDIA’s NCCL (NVIDIA Collective Communication Library), MPI (Message Passing Interface), and Gloo (developed by Meta, now opensource). I’ll introduce fundamental communication patterns, such as broadcast, scatter, reduce, all-reduce, that underpin these protocols. I hope you’ll understand and appreciate the complexities of communication in distributed systems!

This article is my attempt to compile and simplify key knowledge from various sources, including academic papers, official NCCI/MPI/Gloo documentation, and industry blogs. It’s certainly not exhaustive compared to the vast wealth of detailed resources available online. I encourage you all to dive deeper into the articles/links I have shared to further your knowledge!

Let’s dive in!

Key Communication Patterns in Distributed Systems

Whether you using NCCL, MPI, or Gloo, distributed ML relies on a handful of fundamental communication patterns. So before we jump in the specific protocols, let’s look into these common communication patterns. This is critical for understanding how data moves around in distributed compute system.

1. Broadcast: Efficiently Sharing the Same Data

Think about tuning into your favorite radio station. The host speaks once, yet thousands of listeners hear the exact same message at the same time. This is essentially how Broadcasting works in distributed computing as well. It is a one-to-many communication, where one node/process (“root”) sends identical data to all other processes or nodes at the same time.

Broadcast protocol

In distributed machine learning, we regularly encounter broadcast when training large models. Before training begins, we initialize model weights once (usually on root GPU, also commonly known as “GPU 0”) and then broadcast these parameters to all GPUs. This ensures every GPU starts training from the same model state.

Broadcasting is also useful when loading checkpoints. If we load the checkpoint independently from storage on every GPU, it can be slow and wasteful. Instead, we can load the checkpoints to the root GPU once, which then broadcasts the loaded state efficiently to all GPUs.

2. Reduce and All-Reduce: Synchronizing Gradients for Consistent Model Updates

Reduce is a many-to-one operation where data from all the nodes / processes are combined (reduced) using operations such as sum, min, and max. The result is then stored in the root node (GPU 0).

Reduce protocol

All-Reduce goes a step further. It performs the reduction and then distributes the result back to all processes. In other words, All-Reduce = Reduce + Broadcast of the result.

All-Reduce protocol

As you can imagine, All-reduce is central to modern distributed ML because of gradient synchronization. Let’s consider data-parallel training, which powers nearly all large models today. During each iteration, every GPU computes its own gradients separately. For consistent model updates, we must average these gradients and share them back with all the GPUs. We can do this efficiently with an all-reduce operation, where gradients are summed across all GPUs, and the result is sent back to every GPU. Each GPU then locally divides by the total number of GPUs to obtain the new averages.

3. Scatter and Gather: Distributing and Collecting Distinct Data

Scatter is one-to-many pattern like broadcast. But unlike broadcast, each receiver gets a different portion of the data. For example, the root (e.g. GPU0) starts with a large collection of data. The scatter operation sends chunk 1 to process 1, chunk 2 to process 2, and so on. So after scattering, each process holds only its assigned portion of the original data.

Scatter protocol

Gather is the opposite pattern (many-to-one). In this case, the designated root process collects chunks of data from all processes and concatenates or aggregates them.

Gather protocol

When compared to Broadcast and All-Reduce, Scatter/Gather is less common in synchronous data-parallel training because each worker normally reads its own batch of data directly. Now you may wonder when to use All-Reduce versus Scatter/Gather, as they both seem to involve exchanging data across nodes. Here’s how you can clearly differentiate these communication patterns:

  • In All-Reduce, each GPU does the same operation (like computing gradients) independently on different chunks of data. For example, each GPU calculates gradients separately for its mini-batch during training. The goal of All-Reduce is to combine gradients across all GPUs, and then share the combined result back with each GPU.
  • In contrast, Scatter/Gather involves splitting up data so that each GPU receives a distinct chunk and could potentially perform independent tasks. Consider a large evaluation dataset of 1000 images. You could use Scatter to distribute unique image subsets to each GPU. GPU 0 evaluates images 1–250, GPU 1 evaluates images 251–500, and so on. Each GPU thus performs inference seperately on distinct data subsets. Finally, the results from all GPUs are collected back to one node via Gather, producing a single combined evaluation result.

4. All-Gather and All-to-All: Critical Patterns for Model Parallelism

You can think of All-Gather and All-to-All as expansion of Gather/Scatter to many-to-many. In All-Gather, every process (GPU) collects distinct data chunks from every other GPU. This is used widely in model-parallel/tensor-parallel setups for LLMs. Each GPU will start with its own piece of the tensor and uses an All-Gather operation to efficiently collect data chunks (shards) from all the other GPUs. As a result, every GPU will get the complete tensor and allow the later layers to perform calculations that require global context.

All-Gather protocol

In All-to-All, each process (GPU) sending unique data to every other process and, at the same time, receiving unique data back from each one. This means every GPU exchanges a distinct data segment with every other GPU simultaneously. In pipeline parallelism (commonly used in training LLMs models across multiple GPUs), All-to-All helps move intermediate activations from one pipeline stage to another.

All-to-all protocol

4. More Advanced Patterns: Reduce-Scatter, and Barrier

Beyond the patterns we discussed so far, there are some additional communication patterns I’ll introduce quickly for completeness:

  • Reduce-Scatter: A variation of All-Reduce, giving each GPU just a piece of the final aggregated result. This optimizes memory use, especially useful in sharded training methods like PyTorch’s FSDP or DeepSpeed ZeRO.
  • Barrier: This is a synchronization pattern where processes wait until all have reached a certain point. It synchronizes all GPUs at specific points, ensuring all GPUs/processes complete one phase before starting the next. This is important for precise benchmarks, evaluations, or epoch-based training.

Communication Protocols: NCCL, MPI, and Gloo

Implementing efficient communication between GPUs (or processors in any parallel computing system) is not a simple task, and you should never attempt to implement your own! Therefore, mastering the industry’s 3 primary protocols is essential:

NCCL (NVIDIA Collective Communication Library)

NVIDIA’s NCCL is designed specifically for high-performance GPU communication. It provides us a set of collective communication primitives (such as the communication patterns we discussed above) tailored specifically for GPU-to-GPU data transfer​.

According to NVIDIA’s official NCCL documentation,

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Leading deep learning frameworks such as Caffe2ChainerMxNetPyTorch and TensorFlow have integrated NCCL to accelerate deep learning training on multi-GPU multi-node systems.

If you’re not familiar with the hardware world, don’t worry too much about these terms. Just remember that NCCL leverages specialized, high-speed connections between GPUs, both within the same machine and across a network, to optimize data transfers and reduce latency. You can learn more about these terms and concepts by following the links: PCIe, NVLink, NVIDIA’s Networking solutions.

Virtually every major deep learning framework, from PyTorch and TensorFlow to specialized libraries like Horovod and DeepSpeed, relies on NCCL under the hood for efficient multi-GPU communication. While you may not directly call NCCL APIs yourself, whenever you train models using multiple GPUs with these frameworks, NCCL perhaps powering the synchronization of gradients and parameters.

MPI (Message Passing Interface)

MPI is a general-purpose communication standard, not tied to a particular hardware type. While NCCL targets GPU collectives specifically, MPI is a general-purpose, language-independent specification used for parallelizing tasks across multiple processes and nodes. While it runs on CPU-based clusters by default, many MPI implementations today also support communication across GPUs (known as “CUDA-aware MPI”).

One reason MPI remains popular in distributed ML is its maturity and flexibility. For example, it supports both synchronous and asynchronous communication modes, various network interconnects, and is highly optimized by some vendors. For example, Open MPI, MPICH, and Intel MPI are well-tuned implementations.

While NCCL is the default choice for most ML frameworks when running distributed ML workloads on NVIDIA GPUs, MPI would be a great choice to consider in many other scenarios:

  • Hybrid CPU-GPU Systems: MPI handles communication well when your setup includes CPUs alongside GPUs, such as CPU-heavy clusters or hybrid deployments.
  • Complex Parallel Algorithms: MPI’s versatility and flexibility allows implementing complex communication patterns or custom algorithms beyond standard collectives.

Gloo Communication Library

Gloo Communication Library is an open-source collective communication library that was developed at Facebook (now Meta) to support distributed machine learning workloads. ​It was created to address gaps in existing libraries. Traditional MPI was ubiquitous but not initially optimized for modern deep learning hardware, and NVIDIA’s NCCL was fast but limited to GPUs. Since being open-sourced in 2017, Gloo has become a key component in distributed deep learning, often serving as the default CPU communication backend in PyTorch and other frameworks.

The Gloo provides a lightweight, yet robust support for diverse communication patterns in distributed ML. It is a flexible alternative to NCCL and MPI in less GPU-intensive contexts. It works with CPU-only training, mixed CPU/GPU scenarios, and is the fallback for GPU training when NCCL is unavailable or encountering issues​.

Performance Analysis

I will not perform the analysis myself here, but instead highlight findings from a recent publication by Lee and Lee (2024) titled “Collective Communication Performance Evaluation for Distributed Deep Learning Training“, which I particularly enjoyed reading.

The authors conducted an extensive evaluation comparing the performance of MPI, GLOO, and NCCL for collective communication operations for distributed deep learning. They profiled across various intra-node setups, including bare-metal, singularity, single-docker (multiple GPUs per container), and cross-docker (single GPU per container) configurations. They also benchmarked latency for key operations such as broadcast, gather, all-gather, and all-reduce, using both Linux shell-based tests and PyTorch-driven workloads.

Here are some key findings worth mentioning:

  • NCCL achieved the lowest latency for GPU based all-reduce operations, outperforming MPI and GLOO significantly, up to 345% faster in certain configurations.
  • In highly virtualized cross-docker setups, NCCL’s latency increased notably, by up to 213% compared to single-container scenarios.
  • GLOO performed particularly well in cross-container gather operations, reducing latency by 36% relative to single-docker setups.
  • MPI consistently demonstrated reliable performance, especially suited for CPU-centric communication tasks.

Clearly, a one-size-fits-all solution doesn’t exist, as each protocol shines in particular situations. You must evaluate your distributed ML requirements and choose the solution that best suits your use case!

Conclusion: Why These Patterns Matter in Distributed ML

If you’re serious about distributed machine learning, deeply understanding these communication patterns is critical. It’s so easy to just focus solely on computation, but efficient communication between processes/nodes in many cases determines whether your distributed training scales smoothly or stalls completely. Patterns like broadcast, scatter/gather, and especially all-reduce, all-gather, and all-to-all power virtually every distributed training and inference workload today.

I recognize that if you are an ML engineer or Applied Scientist, you will likely not implement these communication patterns yourself. Modern ML frameworks such as PyTorch, TensorFlow, DeepSpeed, and Horovod will handle them behind the scenes using NCCL, MPI, or Gloo. But understanding these protocols and patterns thoroughly will help you diagnose performance issues, optimize your infrastructure, and design better distributed ML systems.

In upcoming articles, I’ll dive deeper into the practical tools used by the ML community today. We’ll explore popular distributed training frameworks, including PyTorch DDP and FSDP, Microsoft’s DeepSpeed, Ray, and more. We’ll look into when and why you’d choose each frameworks, along with trade-offs. Later in the series, we’ll examine real-world use cases and advanced optimizations.

Stay tuned, there’s plenty more ahead! Feel free to contact me if you have any questions or would like to provide feedback!

Distributed Machine Learning Series

  • Scaling AI: The Essentials of Distributed Machine Learning: Discover how distributed machine learning powers today’s largest AI models like GPT-4 and Stable Diffusion. Learn foundational parallel computing concepts, explore real-world strategies, and understand when and how to effectively apply distributed ML techniques.
  • 3 Key Parallel Computing Concepts for Distributed ML: Understand foundational parallel computing concepts such as SIMD, multithreading, and GPU kernels that underpin distributed machine learning. Learn how mastering these fundamentals helps optimize your ML systems before scaling to multi-node setups.

About the article

Disclaimer

The views and opinions expressed in my articles are my own and do not represent those of my current or past employers, or any other affiliations.


Discover more from The ML Architect

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from The ML Architect

Subscribe now to keep reading and get access to the full archive.

Continue reading