Complete guide to NVIDIA CUDA GPU acceleration in PyTorch
PyTorch provides comprehensive support for NVIDIA CUDA-enabled GPUs, enabling massive acceleration for deep learning workloads through GPU computation.
The torch.cuda package adds support for CUDA tensor types that utilize GPUs for computation. It implements the same functions as CPU tensors but leverages NVIDIA GPUs for significantly faster numerical operations.
CUDA operations are lazily initialized - you can always import torch.cuda, and use is_available() to check if your system supports CUDA.
# Disable CUDA supportexport USE_CUDA=0# Set custom CUDA installation pathexport PATH=/usr/local/cuda-12.8/bin:$PATH# Set ROCm installation directory (for AMD GPUs)export ROCM_PATH=/opt/rocm
import torch# Check if CUDA is availableif torch.cuda.is_available(): print(f"CUDA is available with {torch.cuda.device_count()} GPU(s)") print(f"Current device: {torch.cuda.current_device()}") print(f"Device name: {torch.cuda.get_device_name(0)}")else: print("CUDA is not available")
# Use context manager to temporarily switch deviceswith torch.cuda.device(1): # Operations here use GPU 1 tensor = torch.randn(100, 100, device='cuda') result = tensor @ tensor.T# Back to previous device
# Create tensor on CPUx = torch.randn(100, 100)# Move to GPUx_gpu = x.to('cuda')# Move to specific GPUx_gpu1 = x.to('cuda:1')# Move back to CPUx_cpu = x_gpu.to('cpu')# Keep device unchanged if already on targetx_safe = x.to('cuda') # No-op if already on CUDA
Cross-device operations are not allowed. Ensure all tensors are on the same device before performing operations:
a = torch.randn(10, device='cuda:0')b = torch.randn(10, device='cuda:1')# This will raise an error# c = a + b# Correct approachb = b.to('cuda:0')c = a + b
Streams allow asynchronous GPU operations for better performance.
# Create a new streamstream = torch.cuda.Stream()# Use stream as context managerwith torch.cuda.stream(stream): # Operations in this block use the specified stream output = model(input_tensor)# Wait for stream to completestream.synchronize()
# Create two streamsstream1 = torch.cuda.Stream()stream2 = torch.cuda.Stream()with torch.cuda.stream(stream1): x = torch.randn(100, 100, device='cuda') y = x @ x.Twith torch.cuda.stream(stream2): z = torch.randn(100, 100, device='cuda') # Wait for stream1 to complete stream2.wait_stream(stream1) # Now safe to use y from stream1 result = y + z
# Create and record events for timingstart_event = torch.cuda.Event(enable_timing=True)end_event = torch.cuda.Event(enable_timing=True)start_event.record()# Perform operationsoutput = model(input_tensor)end_event.record()torch.cuda.synchronize()elapsed_time = start_event.elapsed_time(end_event)print(f"Operation took {elapsed_time:.2f} ms")
# Wait for all operations on current devicetorch.cuda.synchronize()# Wait for specific devicetorch.cuda.synchronize(device=0)# Check if stream has completedstream = torch.cuda.Stream()with torch.cuda.stream(stream): output = model(input)if stream.query(): print("Stream completed")else: print("Stream still running")
# Set random seed for reproducibilitytorch.cuda.manual_seed(42)# Set seed for all GPUstorch.cuda.manual_seed_all(42)# Get/set RNG staterng_state = torch.cuda.get_rng_state()# ... perform operations ...torch.cuda.set_rng_state(rng_state) # Restore state
import torch.distributed as distfrom torch.nn.parallel import DistributedDataParallel as DDP# Initialize process groupdist.init_process_group(backend='nccl')# Create model and move to GPUmodel = MyModel().cuda()model = DDP(model)# Training loopfor input, target in dataloader: output = model(input.cuda()) loss = criterion(output, target.cuda()) loss.backward() optimizer.step()