Skip to main content
PyTorch provides comprehensive support for NVIDIA CUDA-enabled GPUs, enabling massive acceleration for deep learning workloads through GPU computation.

Overview

The torch.cuda package adds support for CUDA tensor types that utilize GPUs for computation. It implements the same functions as CPU tensors but leverages NVIDIA GPUs for significantly faster numerical operations.
CUDA operations are lazily initialized - you can always import torch.cuda, and use is_available() to check if your system supports CUDA.

Installation & Setup

Prerequisites

To compile PyTorch with CUDA support, you need:
Refer to the cuDNN Support Matrix for version compatibility across CUDA, CUDA drivers, and NVIDIA hardware.

Environment Variables

# Disable CUDA support
export USE_CUDA=0

# Set custom CUDA installation path
export PATH=/usr/local/cuda-12.8/bin:$PATH

# Set ROCm installation directory (for AMD GPUs)
export ROCM_PATH=/opt/rocm

Building from Source

# Install dependencies
pip install mkl-static mkl-include

# CUDA only: Add LAPACK support for GPU
.ci/docker/common/install_magma_conda.sh 12.4

# Build PyTorch
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
python -m pip install --no-build-isolation -v -e .

Device Management

Checking CUDA Availability

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print(f"CUDA is available with {torch.cuda.device_count()} GPU(s)")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available")

Device Selection

# Use context manager to temporarily switch devices
with torch.cuda.device(1):
    # Operations here use GPU 1
    tensor = torch.randn(100, 100, device='cuda')
    result = tensor @ tensor.T

# Back to previous device

Device Properties

# Get device properties
props = torch.cuda.get_device_properties(0)

print(f"Device name: {props.name}")
print(f"Compute capability: {props.major}.{props.minor}")
print(f"Total memory: {props.total_memory / 1e9:.2f} GB")
print(f"Multi-processor count: {props.multi_processor_count}")

Tensor Operations

Creating CUDA Tensors

# Method 1: Specify device during creation
tensor_gpu = torch.randn(1000, 1000, device='cuda')

# Method 2: Move existing tensor to GPU
tensor_cpu = torch.randn(1000, 1000)
tensor_gpu = tensor_cpu.to('cuda')

# Method 3: Use cuda() method
tensor_gpu = tensor_cpu.cuda()

# Specify device index
tensor_gpu1 = torch.randn(100, 100, device='cuda:1')

Moving Tensors Between Devices

# Create tensor on CPU
x = torch.randn(100, 100)

# Move to GPU
x_gpu = x.to('cuda')

# Move to specific GPU
x_gpu1 = x.to('cuda:1')

# Move back to CPU
x_cpu = x_gpu.to('cpu')

# Keep device unchanged if already on target
x_safe = x.to('cuda')  # No-op if already on CUDA
Cross-device operations are not allowed. Ensure all tensors are on the same device before performing operations:
a = torch.randn(10, device='cuda:0')
b = torch.randn(10, device='cuda:1')

# This will raise an error
# c = a + b

# Correct approach
b = b.to('cuda:0')
c = a + b

Memory Management

PyTorch uses a caching memory allocator for efficient GPU memory management.

Memory Information

# Get current memory usage
allocated = torch.cuda.memory_allocated(0)
reserved = torch.cuda.memory_reserved(0)

print(f"Allocated: {allocated / 1e9:.2f} GB")
print(f"Reserved: {reserved / 1e9:.2f} GB")

# Get memory summary
print(torch.cuda.memory_summary(device=0))

Memory Cleanup

# Free unused cached memory
torch.cuda.empty_cache()

# Reset peak memory stats
torch.cuda.reset_peak_memory_stats(0)

# Reset all memory stats
torch.cuda.reset_accumulated_memory_stats(0)

Memory Allocation

# Direct memory allocation for interop with other frameworks
ptr = torch.cuda.caching_allocator_alloc(
    size=1024 * 1024,  # 1MB
    device=0,
    stream=torch.cuda.current_stream()
)

# Free allocated memory
torch.cuda.caching_allocator_delete(ptr)

Out of Memory (OOM) Handling

try:
    # Attempt large allocation
    huge_tensor = torch.randn(100000, 100000, device='cuda')
except torch.cuda.OutOfMemoryError:
    print("Out of memory! Clearing cache...")
    torch.cuda.empty_cache()
    # Try with smaller tensor
    smaller_tensor = torch.randn(10000, 10000, device='cuda')

Streams and Synchronization

CUDA Streams

Streams allow asynchronous GPU operations for better performance.
# Create a new stream
stream = torch.cuda.Stream()

# Use stream as context manager
with torch.cuda.stream(stream):
    # Operations in this block use the specified stream
    output = model(input_tensor)

# Wait for stream to complete
stream.synchronize()

Stream Synchronization

# Create two streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

with torch.cuda.stream(stream1):
    x = torch.randn(100, 100, device='cuda')
    y = x @ x.T

with torch.cuda.stream(stream2):
    z = torch.randn(100, 100, device='cuda')
    
    # Wait for stream1 to complete
    stream2.wait_stream(stream1)
    
    # Now safe to use y from stream1
    result = y + z

Events

# Create and record events for timing
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()

# Perform operations
output = model(input_tensor)

end_event.record()
torch.cuda.synchronize()

elapsed_time = start_event.elapsed_time(end_event)
print(f"Operation took {elapsed_time:.2f} ms")

Device Synchronization

# Wait for all operations on current device
torch.cuda.synchronize()

# Wait for specific device
torch.cuda.synchronize(device=0)

# Check if stream has completed
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
    output = model(input)

if stream.query():
    print("Stream completed")
else:
    print("Stream still running")

CUDA Graphs

CUDA Graphs capture and replay sequences of operations for reduced overhead.
# Create static input
static_input = torch.randn(1000, 1000, device='cuda')
static_output = torch.empty(1000, 1000, device='cuda')

# Capture graph
g = torch.cuda.CUDAGraph()

with torch.cuda.graph(g):
    static_output = static_input @ static_input.T

# Replay graph (much faster)
for _ in range(100):
    g.replay()
CUDA Graphs require static memory addresses. Don’t allocate new tensors inside the graph capture.

Mixed Precision Training

Automatic Mixed Precision (AMP)

from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

for epoch in range(num_epochs):
    for input, target in dataloader:
        input = input.cuda()
        target = target.cuda()
        
        optimizer.zero_grad()
        
        # Automatic mixed precision
        with autocast():
            output = model(input)
            loss = criterion(output, target)
        
        # Scale loss and backward
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

TF32 and BF16 Support

# Check TF32 support
if torch.cuda.is_tf32_supported():
    # Enable TF32 for matmul operations
    torch.backends.cuda.matmul.allow_tf32 = True
    
# Check BF16 support  
if torch.cuda.is_bf16_supported():
    print("BFloat16 is supported")
    tensor_bf16 = torch.randn(100, 100, dtype=torch.bfloat16, device='cuda')

Random Number Generation

# Set random seed for reproducibility
torch.cuda.manual_seed(42)

# Set seed for all GPUs
torch.cuda.manual_seed_all(42)

# Get/set RNG state
rng_state = torch.cuda.get_rng_state()
# ... perform operations ...
torch.cuda.set_rng_state(rng_state)  # Restore state

Multi-GPU Training

DataParallel

model = MyModel()
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = torch.nn.DataParallel(model)

model = model.cuda()

DistributedDataParallel

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group(backend='nccl')

# Create model and move to GPU
model = MyModel().cuda()
model = DDP(model)

# Training loop
for input, target in dataloader:
    output = model(input.cuda())
    loss = criterion(output, target.cuda())
    loss.backward()
    optimizer.step()

Performance Optimization

Pinned Memory

# Create pinned memory tensor for faster CPU-GPU transfer
tensor_pinned = torch.randn(1000, 1000).pin_memory()
tensor_gpu = tensor_pinned.cuda(non_blocking=True)

# Use pinned memory in DataLoader
train_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=32,
    pin_memory=True,
    num_workers=4
)

Profiling

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    model(input_tensor.cuda())

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Utilities

Device Capability Check

# Get compute capability
major, minor = torch.cuda.get_device_capability(0)
print(f"Compute capability: {major}.{minor}")

# Get supported architectures
arch_list = torch.cuda.get_arch_list()
print(f"Supported architectures: {arch_list}")

NVTX Markers for Profiling

import torch.cuda.nvtx as nvtx

# Add range markers
nvtx.range_push("forward pass")
output = model(input)
nvtx.range_pop()

nvtx.range_push("backward pass")
loss.backward()
nvtx.range_pop()

Common Issues

CUDA Out of MemoryIf you encounter OOM errors:
  1. Reduce batch size
  2. Use gradient accumulation
  3. Enable gradient checkpointing
  4. Clear cache with torch.cuda.empty_cache()
  5. Use mixed precision training
Fork CompatibilityCannot re-initialize CUDA in forked subprocess. Use the 'spawn' start method for multiprocessing:
import torch.multiprocessing as mp
mp.set_start_method('spawn')

API Reference

Key functions and classes:
  • torch.cuda.is_available() - Check CUDA availability
  • torch.cuda.device_count() - Get number of GPUs
  • torch.cuda.current_device() - Get current device index
  • torch.cuda.get_device_name() - Get device name
  • torch.cuda.get_device_properties() - Get device properties
  • torch.cuda.memory_allocated() - Get allocated memory
  • torch.cuda.memory_reserved() - Get reserved memory
  • torch.cuda.synchronize() - Synchronize all streams
  • torch.cuda.Stream - CUDA stream for async operations
  • torch.cuda.Event - CUDA event for synchronization
  • torch.cuda.CUDAGraph - Capture and replay operation graphs
See CUDA Semantics for more details.