CUDA Support - PyTorch

PyTorch provides comprehensive support for NVIDIA CUDA-enabled GPUs, enabling massive acceleration for deep learning workloads through GPU computation.

Overview

The torch.cuda package adds support for CUDA tensor types that utilize GPUs for computation. It implements the same functions as CPU tensors but leverages NVIDIA GPUs for significantly faster numerical operations.

CUDA operations are lazily initialized - you can always import torch.cuda, and use is_available() to check if your system supports CUDA.

Installation & Setup

Prerequisites

To compile PyTorch with CUDA support, you need:

NVIDIA CUDA Toolkit (supported versions listed in the support matrix)
NVIDIA cuDNN v8.5 or above
A CUDA-compatible compiler (see compatibility guide)

Refer to the cuDNN Support Matrix for version compatibility across CUDA, CUDA drivers, and NVIDIA hardware.

Environment Variables

# Disable CUDA support
export USE_CUDA=0

# Set custom CUDA installation path
export PATH=/usr/local/cuda-12.8/bin:$PATH

# Set ROCm installation directory (for AMD GPUs)
export ROCM_PATH=/opt/rocm

Building from Source

# Install dependencies
pip install mkl-static mkl-include

# CUDA only: Add LAPACK support for GPU
.ci/docker/common/install_magma_conda.sh 12.4

# Build PyTorch
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
python -m pip install --no-build-isolation -v -e .

Device Management

Checking CUDA Availability

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print(f"CUDA is available with {torch.cuda.device_count()} GPU(s)")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available")

Device Selection

# Use context manager to temporarily switch devices
with torch.cuda.device(1):
    # Operations here use GPU 1
    tensor = torch.randn(100, 100, device='cuda')
    result = tensor @ tensor.T

# Back to previous device

Device Properties

# Get device properties
props = torch.cuda.get_device_properties(0)

print(f"Device name: {props.name}")
print(f"Compute capability: {props.major}.{props.minor}")
print(f"Total memory: {props.total_memory / 1e9:.2f} GB")
print(f"Multi-processor count: {props.multi_processor_count}")

Tensor Operations

Creating CUDA Tensors

# Method 1: Specify device during creation
tensor_gpu = torch.randn(1000, 1000, device='cuda')

# Method 2: Move existing tensor to GPU
tensor_cpu = torch.randn(1000, 1000)
tensor_gpu = tensor_cpu.to('cuda')

# Method 3: Use cuda() method
tensor_gpu = tensor_cpu.cuda()

# Specify device index
tensor_gpu1 = torch.randn(100, 100, device='cuda:1')

Moving Tensors Between Devices

# Create tensor on CPU
x = torch.randn(100, 100)

# Move to GPU
x_gpu = x.to('cuda')

# Move to specific GPU
x_gpu1 = x.to('cuda:1')

# Move back to CPU
x_cpu = x_gpu.to('cpu')

# Keep device unchanged if already on target
x_safe = x.to('cuda')  # No-op if already on CUDA

Cross-device operations are not allowed. Ensure all tensors are on the same device before performing operations:

a = torch.randn(10, device='cuda:0')
b = torch.randn(10, device='cuda:1')

# This will raise an error
# c = a + b

# Correct approach
b = b.to('cuda:0')
c = a + b

Memory Management

PyTorch uses a caching memory allocator for efficient GPU memory management.

Memory Information

# Get current memory usage
allocated = torch.cuda.memory_allocated(0)
reserved = torch.cuda.memory_reserved(0)

print(f"Allocated: {allocated / 1e9:.2f} GB")
print(f"Reserved: {reserved / 1e9:.2f} GB")

# Get memory summary
print(torch.cuda.memory_summary(device=0))

Memory Cleanup

# Free unused cached memory
torch.cuda.empty_cache()

# Reset peak memory stats
torch.cuda.reset_peak_memory_stats(0)

# Reset all memory stats
torch.cuda.reset_accumulated_memory_stats(0)

Memory Allocation

# Direct memory allocation for interop with other frameworks
ptr = torch.cuda.caching_allocator_alloc(
    size=1024 * 1024,  # 1MB
    device=0,
    stream=torch.cuda.current_stream()
)

# Free allocated memory
torch.cuda.caching_allocator_delete(ptr)

Out of Memory (OOM) Handling

try:
    # Attempt large allocation
    huge_tensor = torch.randn(100000, 100000, device='cuda')
except torch.cuda.OutOfMemoryError:
    print("Out of memory! Clearing cache...")
    torch.cuda.empty_cache()
    # Try with smaller tensor
    smaller_tensor = torch.randn(10000, 10000, device='cuda')

Streams and Synchronization

CUDA Streams

Streams allow asynchronous GPU operations for better performance.

# Create a new stream
stream = torch.cuda.Stream()

# Use stream as context manager
with torch.cuda.stream(stream):
    # Operations in this block use the specified stream
    output = model(input_tensor)

# Wait for stream to complete
stream.synchronize()

Stream Synchronization

# Create two streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

with torch.cuda.stream(stream1):
    x = torch.randn(100, 100, device='cuda')
    y = x @ x.T

with torch.cuda.stream(stream2):
    z = torch.randn(100, 100, device='cuda')
    
    # Wait for stream1 to complete
    stream2.wait_stream(stream1)
    
    # Now safe to use y from stream1
    result = y + z

Events

# Create and record events for timing
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()

# Perform operations
output = model(input_tensor)

end_event.record()
torch.cuda.synchronize()

elapsed_time = start_event.elapsed_time(end_event)
print(f"Operation took {elapsed_time:.2f} ms")

Device Synchronization

# Wait for all operations on current device
torch.cuda.synchronize()

# Wait for specific device
torch.cuda.synchronize(device=0)

# Check if stream has completed
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
    output = model(input)

if stream.query():
    print("Stream completed")
else:
    print("Stream still running")

CUDA Graphs

CUDA Graphs capture and replay sequences of operations for reduced overhead.

# Create static input
static_input = torch.randn(1000, 1000, device='cuda')
static_output = torch.empty(1000, 1000, device='cuda')

# Capture graph
g = torch.cuda.CUDAGraph()

with torch.cuda.graph(g):
    static_output = static_input @ static_input.T

# Replay graph (much faster)
for _ in range(100):
    g.replay()

CUDA Graphs require static memory addresses. Don’t allocate new tensors inside the graph capture.

Mixed Precision Training

Automatic Mixed Precision (AMP)

from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

for epoch in range(num_epochs):
    for input, target in dataloader:
        input = input.cuda()
        target = target.cuda()
        
        optimizer.zero_grad()
        
        # Automatic mixed precision
        with autocast():
            output = model(input)
            loss = criterion(output, target)
        
        # Scale loss and backward
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

TF32 and BF16 Support

# Check TF32 support
if torch.cuda.is_tf32_supported():
    # Enable TF32 for matmul operations
    torch.backends.cuda.matmul.allow_tf32 = True
    
# Check BF16 support  
if torch.cuda.is_bf16_supported():
    print("BFloat16 is supported")
    tensor_bf16 = torch.randn(100, 100, dtype=torch.bfloat16, device='cuda')

Random Number Generation

# Set random seed for reproducibility
torch.cuda.manual_seed(42)

# Set seed for all GPUs
torch.cuda.manual_seed_all(42)

# Get/set RNG state
rng_state = torch.cuda.get_rng_state()
# ... perform operations ...
torch.cuda.set_rng_state(rng_state)  # Restore state

Multi-GPU Training

DataParallel

model = MyModel()
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = torch.nn.DataParallel(model)

model = model.cuda()

DistributedDataParallel

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group(backend='nccl')

# Create model and move to GPU
model = MyModel().cuda()
model = DDP(model)

# Training loop
for input, target in dataloader:
    output = model(input.cuda())
    loss = criterion(output, target.cuda())
    loss.backward()
    optimizer.step()

Performance Optimization

Pinned Memory

# Create pinned memory tensor for faster CPU-GPU transfer
tensor_pinned = torch.randn(1000, 1000).pin_memory()
tensor_gpu = tensor_pinned.cuda(non_blocking=True)

# Use pinned memory in DataLoader
train_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=32,
    pin_memory=True,
    num_workers=4
)

Profiling

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    model(input_tensor.cuda())

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Utilities

Device Capability Check

# Get compute capability
major, minor = torch.cuda.get_device_capability(0)
print(f"Compute capability: {major}.{minor}")

# Get supported architectures
arch_list = torch.cuda.get_arch_list()
print(f"Supported architectures: {arch_list}")

NVTX Markers for Profiling

import torch.cuda.nvtx as nvtx

# Add range markers
nvtx.range_push("forward pass")
output = model(input)
nvtx.range_pop()

nvtx.range_push("backward pass")
loss.backward()
nvtx.range_pop()

Common Issues

CUDA Out of MemoryIf you encounter OOM errors:

Reduce batch size
Use gradient accumulation
Enable gradient checkpointing
Clear cache with torch.cuda.empty_cache()
Use mixed precision training

Fork CompatibilityCannot re-initialize CUDA in forked subprocess. Use the 'spawn' start method for multiprocessing:

import torch.multiprocessing as mp
mp.set_start_method('spawn')

API Reference

Key functions and classes:

torch.cuda.is_available() - Check CUDA availability
torch.cuda.device_count() - Get number of GPUs
torch.cuda.current_device() - Get current device index
torch.cuda.get_device_name() - Get device name
torch.cuda.get_device_properties() - Get device properties
torch.cuda.memory_allocated() - Get allocated memory
torch.cuda.memory_reserved() - Get reserved memory
torch.cuda.synchronize() - Synchronize all streams
torch.cuda.Stream - CUDA stream for async operations
torch.cuda.Event - CUDA event for synchronization
torch.cuda.CUDAGraph - Capture and replay operation graphs

See CUDA Semantics for more details.

​Overview

​Installation & Setup

​Prerequisites

​Environment Variables

​Building from Source

​Device Management

​Checking CUDA Availability

​Device Selection

​Device Properties

​Tensor Operations

​Creating CUDA Tensors

​Moving Tensors Between Devices

​Memory Management

​Memory Information

​Memory Cleanup

​Memory Allocation

​Out of Memory (OOM) Handling

​Streams and Synchronization

​CUDA Streams

​Stream Synchronization

​Events

​Device Synchronization

​CUDA Graphs

​Mixed Precision Training

​Automatic Mixed Precision (AMP)

​TF32 and BF16 Support

​Random Number Generation

​Multi-GPU Training

​DataParallel

​DistributedDataParallel

​Performance Optimization

​Pinned Memory

​Profiling

​Utilities

​Device Capability Check

​NVTX Markers for Profiling

​Common Issues

​API Reference

Overview

Installation & Setup

Prerequisites

Environment Variables

Building from Source

Device Management

Checking CUDA Availability

Device Selection

Device Properties

Tensor Operations

Creating CUDA Tensors

Moving Tensors Between Devices

Memory Management

Memory Information

Memory Cleanup

Memory Allocation

Out of Memory (OOM) Handling

Streams and Synchronization

CUDA Streams

Stream Synchronization

Events

Device Synchronization

CUDA Graphs

Mixed Precision Training

Automatic Mixed Precision (AMP)

TF32 and BF16 Support

Random Number Generation

Multi-GPU Training

DataParallel

DistributedDataParallel

Performance Optimization

Pinned Memory

Profiling

Utilities

Device Capability Check

NVTX Markers for Profiling

Common Issues

API Reference