Skip to main content

Overview

The torch.cuda package adds support for CUDA tensor types that utilize GPUs for computation. It implements the same functions as CPU tensors but leverages GPU acceleration. It is lazily initialized, so you can always import it and use is_available() to determine if your system supports CUDA.

Device Management

Returns a bool indicating if CUDA is currently available.
available
bool
True if CUDA is available on the system.
import torch

if torch.cuda.is_available():
    print("CUDA is available!")
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
Returns the number of GPUs available.
count
int
Number of CUDA devices.
>>> torch.cuda.device_count()
4  # 4 GPUs available
Returns the index of the currently selected device.
index
int
Index of current CUDA device.
>>> torch.cuda.current_device()
0
Sets the current device.
device
int or torch.device
required
Selected device index.
torch.cuda.set_device(0)  # Use GPU 0
x = torch.randn(100, 100).cuda()  # Goes to GPU 0
Context manager that changes the selected device.
device
int or torch.device
required
Device index to select.
# Default device is 0
x = torch.randn(100, 100).cuda()

# Temporarily use GPU 1
with torch.cuda.device(1):
    y = torch.randn(100, 100).cuda()  # On GPU 1

# Back to GPU 0
z = torch.randn(100, 100).cuda()  # On GPU 0
Gets the name of a device.
device
int or torch.device
Device index. If None, uses current device.
name
str
Name of the CUDA device.
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 4090'
Gets the compute capability of a device.
device
int or torch.device
Device index. If None, uses current device.
capability
tuple[int, int]
(major, minor) compute capability version.
>>> torch.cuda.get_device_capability(0)
(8, 9)  # Compute capability 8.9
Gets the properties of a device.
device
int or torch.device
required
Device for which to return properties.
properties
_CudaDeviceProperties
Device properties object with attributes:
  • name: Device name
  • major: Major compute capability
  • minor: Minor compute capability
  • total_memory: Total memory in bytes
  • multi_processor_count: Number of multiprocessors
>>> props = torch.cuda.get_device_properties(0)
>>> print(f"Name: {props.name}")
>>> print(f"Memory: {props.total_memory / 1e9:.2f} GB")
>>> print(f"Compute: {props.major}.{props.minor}")

Memory Management

Returns the current GPU memory occupied by tensors in bytes.
device
int or torch.device
Selected device. If None, uses current device.
bytes
int
Memory occupied by tensors.
>>> x = torch.randn(1000, 1000, device='cuda')
>>> torch.cuda.memory_allocated()
4000000  # ~4MB
Returns the current GPU memory managed by caching allocator in bytes.
device
int or torch.device
Selected device. If None, uses current device.
bytes
int
Memory reserved by caching allocator.
>>> torch.cuda.memory_reserved()
33554432  # 32MB reserved
Returns the maximum GPU memory occupied by tensors in bytes.
device
int or torch.device
Selected device. If None, uses current device.
bytes
int
Peak memory usage.
>>> torch.cuda.max_memory_allocated()
536870912  # 512MB peak
Resets the peak memory statistics.
device
int or torch.device
Selected device. If None, uses current device.
# Reset stats before measuring
torch.cuda.reset_peak_memory_stats()

# Run model
output = model(input)

# Check peak memory
peak = torch.cuda.max_memory_allocated()
print(f"Peak memory: {peak / 1e9:.2f} GB")
Releases all unoccupied cached memory currently held by the caching allocator.This doesn’t free memory occupied by PyTorch tensors, only the unused cached memory. Use this when you need to free memory for other processes.
# After deleting large tensors
del large_tensor
torch.cuda.empty_cache()  # Return memory to GPU
Returns a human-readable summary of memory allocator statistics.
device
int or torch.device
Selected device. If None, uses current device.
abbreviated
bool
default:"False"
If True, returns abbreviated summary.
summary
str
Formatted memory statistics.
>>> print(torch.cuda.memory_summary())
|===========================================================================|
|                  PyTorch CUDA memory summary                              |
|===========================================================================|
|            CUDA OOMs: 0            |      cudaMalloc retries: 0           |
...
Returns a dictionary of memory allocator statistics.
device
int or torch.device
Selected device. If None, uses current device.
stats
dict
Dictionary containing detailed memory statistics including:
  • allocated_bytes.all.current: Current allocated memory
  • reserved_bytes.all.current: Current reserved memory
  • active_bytes.all.current: Current active memory
  • And many more detailed metrics
>>> stats = torch.cuda.memory_stats()
>>> print(stats['allocated_bytes.all.current'])
4000000

Stream and Event Management

Wrapper around a CUDA stream for asynchronous operations.
device
int or torch.device
Device for which to create stream.
priority
int
Priority of the stream. Lower numbers represent higher priorities.
import torch

# Create custom stream
stream = torch.cuda.Stream()

# Operations in stream run asynchronously
with torch.cuda.stream(stream):
    x = torch.randn(1000, 1000, device='cuda')
    y = x @ x.t()

# Wait for stream to complete
stream.synchronize()
Context manager that selects a given stream.
stream
Stream
required
Stream to use for operations.
s = torch.cuda.Stream()

# All operations in this context use stream s
with torch.cuda.stream(s):
    x = torch.randn(100, 100, device='cuda')
    y = x @ x.t()
Wrapper around a CUDA event for timing and synchronization.
enable_timing
bool
default:"False"
If True, event can be used to measure time.
blocking
bool
default:"False"
If True, event blocks until it’s been recorded.
interprocess
bool
default:"False"
If True, event can be shared between processes.
# Time GPU operations
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
# GPU operations
x = torch.randn(10000, 10000, device='cuda')
y = x @ x.t()
end.record()

# Wait for completion
torch.cuda.synchronize()

elapsed_time = start.elapsed_time(end)
print(f"Time: {elapsed_time:.2f} ms")
Waits for all kernels in all streams on a device to complete.
device
int or torch.device
Device for which to synchronize. If None, uses current device.
# Launch async operations
x = torch.randn(1000, 1000, device='cuda')
y = x @ x.t()

# Wait for completion
torch.cuda.synchronize()
print("All GPU operations complete")
Returns the currently selected Stream for a given device.
device
int or torch.device
Selected device. If None, uses current device.
stream
Stream
Current stream object.
>>> stream = torch.cuda.current_stream()
>>> print(stream)
<torch.cuda.Stream device=cuda:0 ...>

Random Number Generation

Sets the seed for generating random numbers on the current GPU.
seed
int
required
The desired seed.
torch.cuda.manual_seed(42)
x = torch.randn(100, 100, device='cuda')  # Reproducible
Sets the seed for generating random numbers on all GPUs.
seed
int
required
The desired seed.
# Set seed for all GPUs
torch.cuda.manual_seed_all(42)
Sets the seed for generating random numbers to a random number for the current GPU.
torch.cuda.seed()  # Random seed
Sets the seed for generating random numbers to a random number on all GPUs.
torch.cuda.seed_all()  # Random seed for all GPUs
Returns the current random seed of the current GPU.
seed
int
Current random seed.
>>> torch.cuda.manual_seed(42)
>>> torch.cuda.initial_seed()
42

CUDA Graphs

Wrapper around a CUDA graph for optimizing repeated operations.CUDA graphs capture a sequence of operations and replay them with lower overhead.
import torch

# Static input
static_input = torch.randn(1000, 1000, device='cuda')
static_output = torch.empty(1000, 1000, device='cuda')

# Capture graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    static_output.copy_(static_input @ static_input.t())

# Replay graph (much faster)
for _ in range(100):
    g.replay()
Context manager for capturing a CUDA graph.
cuda_graph
CUDAGraph
required
Graph object to record into.
stream
Stream
Stream to capture. If None, uses current stream.
g = torch.cuda.CUDAGraph()

with torch.cuda.graph(g):
    # Operations to capture
    y = model(x)

# Replay captured operations
g.replay()
Accepts callables and returns graphed versions.
callables
callable or tuple of callables
required
Callables to graph.
sample_args
tuple or tuple of tuples
required
Sample arguments for each callable.
def model_step(x):
    return model(x).sum()

# Create graphed version
graphed_step = torch.cuda.make_graphed_callables(
    model_step,
    (torch.randn(32, 100, device='cuda'),)
)

# Use graphed version (faster)
for batch in data:
    loss = graphed_step(batch)

Capability Checks

Returns True if current CUDA device supports bfloat16.
including_emulation
bool
default:"True"
Whether to include emulated bfloat16 support.
supported
bool
True if bfloat16 is supported.
>>> torch.cuda.is_bf16_supported()
True
Returns True if current CUDA device supports TensorFloat32.
supported
bool
True if TF32 is supported.
>>> torch.cuda.is_tf32_supported()
True  # For Ampere GPUs and newer

Performance Optimization

import torch

# Enable async memory copy
x_cpu = torch.randn(1000, 1000, pin_memory=True)
x_gpu = x_cpu.to('cuda', non_blocking=True)

# Continue with other work while copy happens
y = torch.randn(500, 500, device='cuda')

# Synchronize when needed
torch.cuda.synchronize()

Best Practices

  1. Pin memory for faster transfers:
    tensor = torch.randn(1000, 1000, pin_memory=True)
    tensor_gpu = tensor.to('cuda', non_blocking=True)
    
  2. Clear cache when needed:
    del large_tensor
    torch.cuda.empty_cache()
    
  3. Monitor memory usage:
    print(torch.cuda.memory_summary())
    
import torch
import torch.nn as nn

# DataParallel for simple multi-GPU
model = nn.DataParallel(model)
model = model.cuda()

# Or DistributedDataParallel for better performance
from torch.nn.parallel import DistributedDataParallel as DDP
model = DDP(model, device_ids=[local_rank])
import torch

try:
    if torch.cuda.is_available():
        device = torch.device('cuda')
        x = torch.randn(10000, 10000, device=device)
    else:
        raise RuntimeError("CUDA not available")
except RuntimeError as e:
    print(f"Error: {e}")
    device = torch.device('cpu')

torch Module

Core PyTorch functions

Tensor API

Tensor operations

torch.nn

Neural network modules

Distributed

Multi-GPU training