torch.cuda - GPU Acceleration

Overview

The torch.cuda package adds support for CUDA tensor types that utilize GPUs for computation. It implements the same functions as CPU tensors but leverages GPU acceleration. It is lazily initialized, so you can always import it and use is_available() to determine if your system supports CUDA.

Device Management

torch.cuda.is_available()

Returns a bool indicating if CUDA is currently available.

available

bool

True if CUDA is available on the system.

import torch

if torch.cuda.is_available():
    print("CUDA is available!")
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

torch.cuda.device_count()

Returns the number of GPUs available.

count

int

Number of CUDA devices.

>>> torch.cuda.device_count()
4  # 4 GPUs available

torch.cuda.current_device()

Returns the index of the currently selected device.

index

int

Index of current CUDA device.

>>> torch.cuda.current_device()
0

torch.cuda.set_device()

Sets the current device.

device

int or torch.device

required

Selected device index.

torch.cuda.set_device(0)  # Use GPU 0
x = torch.randn(100, 100).cuda()  # Goes to GPU 0

torch.cuda.device()

Context manager that changes the selected device.

device

int or torch.device

required

Device index to select.

# Default device is 0
x = torch.randn(100, 100).cuda()

# Temporarily use GPU 1
with torch.cuda.device(1):
    y = torch.randn(100, 100).cuda()  # On GPU 1

# Back to GPU 0
z = torch.randn(100, 100).cuda()  # On GPU 0

torch.cuda.get_device_name()

Gets the name of a device.

device

int or torch.device

Device index. If None, uses current device.

name

str

Name of the CUDA device.

>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 4090'

torch.cuda.get_device_capability()

Gets the compute capability of a device.

device

int or torch.device

Device index. If None, uses current device.

capability

tuple[int, int]

(major, minor) compute capability version.

>>> torch.cuda.get_device_capability(0)
(8, 9)  # Compute capability 8.9

torch.cuda.get_device_properties()

Gets the properties of a device.

device

int or torch.device

required

Device for which to return properties.

properties

_CudaDeviceProperties

Device properties object with attributes:

name: Device name
major: Major compute capability
minor: Minor compute capability
total_memory: Total memory in bytes
multi_processor_count: Number of multiprocessors

>>> props = torch.cuda.get_device_properties(0)
>>> print(f"Name: {props.name}")
>>> print(f"Memory: {props.total_memory / 1e9:.2f} GB")
>>> print(f"Compute: {props.major}.{props.minor}")

Memory Management

torch.cuda.memory_allocated()

Returns the current GPU memory occupied by tensors in bytes.

device

int or torch.device

Selected device. If None, uses current device.

bytes

int

Memory occupied by tensors.

>>> x = torch.randn(1000, 1000, device='cuda')
>>> torch.cuda.memory_allocated()
4000000  # ~4MB

torch.cuda.memory_reserved()

Returns the current GPU memory managed by caching allocator in bytes.

device

int or torch.device

Selected device. If None, uses current device.

bytes

int

Memory reserved by caching allocator.

>>> torch.cuda.memory_reserved()
33554432  # 32MB reserved

torch.cuda.max_memory_allocated()

Returns the maximum GPU memory occupied by tensors in bytes.

device

int or torch.device

Selected device. If None, uses current device.

bytes

int

Peak memory usage.

>>> torch.cuda.max_memory_allocated()
536870912  # 512MB peak

torch.cuda.reset_peak_memory_stats()

Resets the peak memory statistics.

device

int or torch.device

Selected device. If None, uses current device.

# Reset stats before measuring
torch.cuda.reset_peak_memory_stats()

# Run model
output = model(input)

# Check peak memory
peak = torch.cuda.max_memory_allocated()
print(f"Peak memory: {peak / 1e9:.2f} GB")

torch.cuda.empty_cache()

Releases all unoccupied cached memory currently held by the caching allocator.This doesn’t free memory occupied by PyTorch tensors, only the unused cached memory. Use this when you need to free memory for other processes.

# After deleting large tensors
del large_tensor
torch.cuda.empty_cache()  # Return memory to GPU

torch.cuda.memory_summary()

Returns a human-readable summary of memory allocator statistics.

device

int or torch.device

Selected device. If None, uses current device.

abbreviated

bool

default:"False"

If True, returns abbreviated summary.

summary

str

Formatted memory statistics.

>>> print(torch.cuda.memory_summary())
|===========================================================================|
|                  PyTorch CUDA memory summary                              |
|===========================================================================|
|            CUDA OOMs: 0            |      cudaMalloc retries: 0           |
...

torch.cuda.memory_stats()

Returns a dictionary of memory allocator statistics.

device

int or torch.device

Selected device. If None, uses current device.

stats

dict

Dictionary containing detailed memory statistics including:

allocated_bytes.all.current: Current allocated memory
reserved_bytes.all.current: Current reserved memory
active_bytes.all.current: Current active memory
And many more detailed metrics

>>> stats = torch.cuda.memory_stats()
>>> print(stats['allocated_bytes.all.current'])
4000000

Stream and Event Management

torch.cuda.Stream

Wrapper around a CUDA stream for asynchronous operations.

device

int or torch.device

Device for which to create stream.

priority

int

Priority of the stream. Lower numbers represent higher priorities.

import torch

# Create custom stream
stream = torch.cuda.Stream()

# Operations in stream run asynchronously
with torch.cuda.stream(stream):
    x = torch.randn(1000, 1000, device='cuda')
    y = x @ x.t()

# Wait for stream to complete
stream.synchronize()

torch.cuda.stream()

Context manager that selects a given stream.

stream

Stream

required

Stream to use for operations.

s = torch.cuda.Stream()

# All operations in this context use stream s
with torch.cuda.stream(s):
    x = torch.randn(100, 100, device='cuda')
    y = x @ x.t()

torch.cuda.Event

Wrapper around a CUDA event for timing and synchronization.

enable_timing

bool

default:"False"

If True, event can be used to measure time.

blocking

bool

default:"False"

If True, event blocks until it’s been recorded.

interprocess

bool

default:"False"

If True, event can be shared between processes.

# Time GPU operations
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
# GPU operations
x = torch.randn(10000, 10000, device='cuda')
y = x @ x.t()
end.record()

# Wait for completion
torch.cuda.synchronize()

elapsed_time = start.elapsed_time(end)
print(f"Time: {elapsed_time:.2f} ms")

torch.cuda.synchronize()

Waits for all kernels in all streams on a device to complete.

device

int or torch.device

Device for which to synchronize. If None, uses current device.

# Launch async operations
x = torch.randn(1000, 1000, device='cuda')
y = x @ x.t()

# Wait for completion
torch.cuda.synchronize()
print("All GPU operations complete")

torch.cuda.current_stream()

Returns the currently selected Stream for a given device.

device

int or torch.device

Selected device. If None, uses current device.

stream

Stream

Current stream object.

>>> stream = torch.cuda.current_stream()
>>> print(stream)
<torch.cuda.Stream device=cuda:0 ...>

Random Number Generation

torch.cuda.manual_seed()

Sets the seed for generating random numbers on the current GPU.

seed

int

required

The desired seed.

torch.cuda.manual_seed(42)
x = torch.randn(100, 100, device='cuda')  # Reproducible

torch.cuda.manual_seed_all()

Sets the seed for generating random numbers on all GPUs.

seed

int

required

The desired seed.

# Set seed for all GPUs
torch.cuda.manual_seed_all(42)

torch.cuda.seed()

Sets the seed for generating random numbers to a random number for the current GPU.

torch.cuda.seed()  # Random seed

torch.cuda.seed_all()

Sets the seed for generating random numbers to a random number on all GPUs.

torch.cuda.seed_all()  # Random seed for all GPUs

torch.cuda.initial_seed()

Returns the current random seed of the current GPU.

seed

int

Current random seed.

>>> torch.cuda.manual_seed(42)
>>> torch.cuda.initial_seed()
42

CUDA Graphs

torch.cuda.CUDAGraph

Wrapper around a CUDA graph for optimizing repeated operations.CUDA graphs capture a sequence of operations and replay them with lower overhead.

import torch

# Static input
static_input = torch.randn(1000, 1000, device='cuda')
static_output = torch.empty(1000, 1000, device='cuda')

# Capture graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    static_output.copy_(static_input @ static_input.t())

# Replay graph (much faster)
for _ in range(100):
    g.replay()

torch.cuda.graph()

Context manager for capturing a CUDA graph.

cuda_graph

CUDAGraph

required

Graph object to record into.

stream

Stream

Stream to capture. If None, uses current stream.

g = torch.cuda.CUDAGraph()

with torch.cuda.graph(g):
    # Operations to capture
    y = model(x)

# Replay captured operations
g.replay()

torch.cuda.make_graphed_callables()

Accepts callables and returns graphed versions.

callables

callable or tuple of callables

required

Callables to graph.

sample_args

tuple or tuple of tuples

required

Sample arguments for each callable.

def model_step(x):
    return model(x).sum()

# Create graphed version
graphed_step = torch.cuda.make_graphed_callables(
    model_step,
    (torch.randn(32, 100, device='cuda'),)
)

# Use graphed version (faster)
for batch in data:
    loss = graphed_step(batch)

Capability Checks

torch.cuda.is_bf16_supported()

Returns True if current CUDA device supports bfloat16.

including_emulation

bool

default:"True"

Whether to include emulated bfloat16 support.

supported

bool

True if bfloat16 is supported.

>>> torch.cuda.is_bf16_supported()
True

torch.cuda.is_tf32_supported()

Returns True if current CUDA device supports TensorFloat32.

supported

bool

True if TF32 is supported.

>>> torch.cuda.is_tf32_supported()
True  # For Ampere GPUs and newer

Performance Optimization

import torch

# Enable async memory copy
x_cpu = torch.randn(1000, 1000, pin_memory=True)
x_gpu = x_cpu.to('cuda', non_blocking=True)

# Continue with other work while copy happens
y = torch.randn(500, 500, device='cuda')

# Synchronize when needed
torch.cuda.synchronize()

Best Practices

Memory Management

Pin memory for faster transfers:

tensor = torch.randn(1000, 1000, pin_memory=True)
tensor_gpu = tensor.to('cuda', non_blocking=True)

Clear cache when needed:

del large_tensor
torch.cuda.empty_cache()

Monitor memory usage:
```
print(torch.cuda.memory_summary())
```

Multi-GPU Training

import torch
import torch.nn as nn

# DataParallel for simple multi-GPU
model = nn.DataParallel(model)
model = model.cuda()

# Or DistributedDataParallel for better performance
from torch.nn.parallel import DistributedDataParallel as DDP
model = DDP(model, device_ids=[local_rank])

Error Handling

import torch

try:
    if torch.cuda.is_available():
        device = torch.device('cuda')
        x = torch.randn(10000, 10000, device=device)
    else:
        raise RuntimeError("CUDA not available")
except RuntimeError as e:
    print(f"Error: {e}")
    device = torch.device('cpu')

torch Module

Core PyTorch functions

Tensor API

Tensor operations

torch.nn

Neural network modules

Distributed

Multi-GPU training

​Overview

​Device Management

​Memory Management

​Stream and Event Management

​Random Number Generation

​CUDA Graphs

​Capability Checks

​Performance Optimization

​Best Practices

​Related APIs

torch Module

Tensor API

torch.nn

Distributed

Overview

Device Management

Memory Management

Stream and Event Management

Random Number Generation

CUDA Graphs

Capability Checks

Performance Optimization

Best Practices

Related APIs