Distributed Training
Thetorch.distributed package supports distributed training across multiple processes and machines. It provides communication primitives and high-level APIs for data parallelism and model parallelism.
Initialization
init_process_group
The backend to use. Valid values include
'nccl', 'gloo', 'mpi'.URL specifying how to initialize the process group. If not specified,
'env://' is used.Number of processes participating in the job. Required if store is specified.
Rank of the current process. Required if store is specified.
Timeout for operations executed against the process group.
is_initialized
True if the default process group has been initialized, False otherwise.get_rank
The process group to work on. If None, the default process group is used.
The rank of the process group (-1 if not part of the group).
get_world_size
The process group to work on.
The world size of the process group.
Communication Primitives
send
Tensor to send.
Destination rank.
The process group to work on.
recv
Tensor to fill with received data.
Source rank. If None, will receive from any process.
broadcast
Data to be sent if
src is the rank of current process, and tensor to be used to save received data otherwise.Source rank.
Whether this op should be an async op.
all_reduce
Input and output of the collective. The function operates in-place.
One of the values from
torch.distributed.ReduceOp enum. Specifies an operation used for element-wise reductions.reduce
Input and output of the collective.
Destination rank.
all_gather
Output list. It should contain correctly-sized tensors to be used for output of the collective.
Tensor to be broadcast from current process.
gather
Input tensor.
List of appropriately-sized tensors to use for received data (required in the receiving process).
Destination rank.
scatter
Output tensor.
List of tensors to scatter (required in the source process).
Source rank.
barrier
The process group to work on.
Whether this op should be an async op.
High-Level APIs
DistributedDataParallel
Module to be parallelized.
CUDA devices for single-device modules.
Whether to find unused parameters. Useful for models with conditional execution.
Whether gradients should be views into the DDP bucket.
Distributed Samplers
DistributedSampler
DistributedDataParallel.
Dataset to be sampled.
Number of processes participating in distributed training.
Rank of the current process within
num_replicas.If
True, sampler will shuffle the indices.Example Usage
Backends
NCCL Backend
NCCL Backend
- Recommended for NVIDIA GPUs
- Best performance for GPU-to-GPU communication
- Supports collective operations on CUDA tensors
Gloo Backend
Gloo Backend
- Supports both CPU and GPU
- Good for CPU-based distributed training
- Cross-platform support (Linux, macOS, Windows)
MPI Backend
MPI Backend
- Requires MPI implementation (e.g., OpenMPI)
- Good for HPC environments
- Supports both CPU and GPU