Skip to main content

Overview

torch.autograd provides classes and functions for automatic differentiation of arbitrary scalar valued functions. It requires minimal changes to existing code - you only need to declare Tensors with requires_grad=True. Supported types: floating point (half, float, double, bfloat16) and complex (cfloat, cdouble) tensors.

Core Functions

Computes the sum of gradients of given tensors with respect to graph leaves.
tensors
Tensor or Sequence[Tensor]
required
Tensors of which the derivative will be computed.
grad_tensors
Tensor or Sequence[Tensor]
The “vector” in the Jacobian-vector product. Should be a sequence of matching length. Required for non-scalar tensors.
retain_graph
bool
If False, the graph used to compute grad will be freed. Defaults to create_graph value.
create_graph
bool
default:"False"
If True, graph of the derivative will be constructed for higher order derivatives.
inputs
Tensor or Sequence[Tensor]
Inputs w.r.t. which the gradient will be accumulated into .grad. If not provided, accumulates into all leaf tensors.
This function accumulates gradients in the leaves. You might need to zero .grad attributes before calling.
import torch

x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x.sum()

# Compute gradients
torch.autograd.backward(y)
print(x.grad)  # tensor([1., 1., 1.])
Computes and returns the sum of gradients of outputs with respect to inputs.
outputs
Tensor or Sequence[Tensor]
required
Outputs of the differentiated function.
inputs
Tensor or Sequence[Tensor]
required
Inputs w.r.t. which the gradient will be returned (not accumulated into .grad).
grad_outputs
Tensor or Sequence[Tensor]
The “vector” in the vector-Jacobian product. Usually gradients w.r.t. each output.
retain_graph
bool
If False, the graph will be freed. Defaults to create_graph value.
create_graph
bool
default:"False"
If True, graph of derivative will be constructed for higher order derivatives.
allow_unused
bool
If False, error if inputs not used. Defaults to materialize_grads value.
is_grads_batched
bool
default:"False"
If True, first dimension of grad_outputs is batch dimension for vectorized Jacobian computation.
materialize_grads
bool
default:"False"
If True, set gradient for unused inputs to zero instead of None.
gradients
tuple[Tensor]
Tuple of gradients, one for each input.
import torch

x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x ** 2
z = y.sum()

# Compute gradients without accumulating
grads = torch.autograd.grad(z, x)
print(grads[0])  # tensor([2., 4., 6.])

Gradient Control

Context manager that disables gradient calculation.Disabling gradient calculation is useful for inference when you are sure you will not call backward(). It reduces memory consumption and speeds up computations.
import torch

x = torch.tensor([1., 2., 3.], requires_grad=True)

with torch.no_grad():
    y = x ** 2  # No gradient tracking
    print(y.requires_grad)  # False
Can also be used as a decorator:
@torch.no_grad()
def inference(model, data):
    return model(data)
Context manager that enables gradient calculation.Enables gradients in a region where they were disabled (e.g., inside a no_grad context).
import torch

x = torch.tensor([1., 2., 3.], requires_grad=True)

with torch.no_grad():
    with torch.enable_grad():
        y = x ** 2  # Gradient tracking enabled
        print(y.requires_grad)  # True
Context manager to set gradient calculation on or off.
mode
bool
required
Flag whether to enable grad (True) or disable (False).
Useful for conditionally enabling/disabling gradients:
import torch

def forward(x, training=True):
    with torch.set_grad_enabled(training):
        y = x ** 2
        return y

x = torch.tensor([1., 2., 3.], requires_grad=True)
y = forward(x, training=False)  # No gradients
Context manager that disables autograd and reduces overhead.Similar to no_grad() but faster and with more restrictions. Tensors created in inference mode cannot be used with autograd afterward.
import torch

with torch.inference_mode():
    # Faster inference, but tensors can't be used
    # for autograd after leaving this context
    y = model(x)
Key differences from no_grad():
  • Lower overhead (faster)
  • Tensors created inside cannot require gradients later
  • View relationships not tracked

Debugging and Validation

Checks gradients computed via small finite differences against analytical gradients.
func
callable
required
A Python function that takes Tensor inputs and returns a Tensor or tuple of Tensors.
inputs
Tensor or tuple of Tensors
required
Inputs to the function.
eps
float
default:"1e-6"
Perturbation for finite differences.
atol
float
default:"1e-5"
Absolute tolerance.
rtol
float
default:"1e-3"
Relative tolerance.
raise_exception
bool
default:"True"
Whether to raise exception on failure.
success
bool
True if gradients are correct.
import torch
from torch.autograd import gradcheck

def my_func(x):
    return x ** 2

x = torch.randn(3, 4, dtype=torch.double, requires_grad=True)
test = gradcheck(my_func, x, eps=1e-6, atol=1e-4)
print(f"Gradcheck passed: {test}")
Checks gradients of gradients (second derivatives).
func
callable
required
A Python function that takes Tensor inputs and returns a Tensor or tuple of Tensors.
inputs
Tensor or tuple of Tensors
required
Inputs to the function.
grad_outputs
Tensor or tuple of Tensors
Gradient outputs for computing second derivatives.
Parameters similar to gradcheck().
import torch
from torch.autograd import gradgradcheck

def my_func(x):
    return (x ** 3).sum()

x = torch.randn(3, 4, dtype=torch.double, requires_grad=True)
test = gradgradcheck(my_func, x)
print(f"Gradgradcheck passed: {test}")
Context manager that enables anomaly detection for autograd.
check_nan
bool
default:"True"
Whether to check for NaN in backward pass.
When enabled, forward pass will run with anomaly detection enabled and backward pass will raise errors if NaN/Inf gradients are detected.
import torch

x = torch.tensor([1., 2., 3.], requires_grad=True)

with torch.autograd.detect_anomaly():
    y = x ** 2
    z = y.sum()
    z.backward()
    # If NaN/Inf detected, provides detailed error
Sets global anomaly detection mode.
mode
bool
required
If True, enables anomaly detection globally.
check_nan
bool
default:"True"
Whether to check for NaN.
torch.autograd.set_detect_anomaly(True)
# Now all backward passes check for anomalies

Custom Functions

Base class for creating custom autograd functions.To create a custom function, subclass Function and implement forward() and backward() static methods.
import torch
from torch.autograd import Function

class MySquare(Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input ** 2
    
    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = 2 * input * grad_output
        return grad_input

# Usage
x = torch.tensor([1., 2., 3.], requires_grad=True)
y = MySquare.apply(x)
y.sum().backward()
print(x.grad)  # tensor([2., 4., 6.])
Methods:
forward(ctx, *args, **kwargs)
staticmethod
Performs the operation. Must be implemented by subclass.
  • ctx: Context object to save information for backward
  • *args, **kwargs: Input tensors and other arguments
  • Returns: Output tensor(s)
backward(ctx, *grad_outputs)
staticmethod
Defines gradient formula. Must be implemented by subclass.
  • ctx: Context object with saved tensors
  • *grad_outputs: Gradients w.r.t. outputs
  • Returns: Gradients w.r.t. inputs (one per input, or None)
Methods available in ctx (context object) for custom functions:*ctx.save_for_backward(tensors) Saves tensors to use in backward pass.
def forward(ctx, input, weight):
    ctx.save_for_backward(input, weight)
    return input @ weight
ctx.saved_tensors Retrieves saved tensors in backward.
def backward(ctx, grad_output):
    input, weight = ctx.saved_tensors
    grad_input = grad_output @ weight.t()
    grad_weight = input.t() @ grad_output
    return grad_input, grad_weight
*ctx.mark_non_differentiable(tensors) Marks output tensors as non-differentiable.ctx.set_materialize_grads(value) Sets whether to materialize gradient tensors that are None.

Advanced Features

Gradients accumulate in .grad attribute by default:
import torch

x = torch.tensor([1., 2., 3.], requires_grad=True)

# First backward
y = x.sum()
y.backward()
print(x.grad)  # tensor([1., 1., 1.])

# Second backward - gradients accumulate
y = x.sum()
y.backward()
print(x.grad)  # tensor([2., 2., 2.])

# Clear gradients
x.grad.zero_()
# Or better for memory:
x.grad = None
Compute gradients of gradients:
import torch

x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x ** 3

# First derivative
grad_x = torch.autograd.grad(
    outputs=y,
    inputs=x,
    grad_outputs=torch.ones_like(y),
    create_graph=True  # Keep graph for next derivative
)[0]

# Second derivative
grad_grad_x = torch.autograd.grad(
    outputs=grad_x,
    inputs=x,
    grad_outputs=torch.ones_like(grad_x)
)[0]

print(grad_x)       # tensor([3., 12., 27.])
print(grad_grad_x)  # tensor([6., 12., 18.])
Trade compute for memory by recomputing forward pass during backward:
import torch
from torch.utils.checkpoint import checkpoint

def custom_forward(x, y):
    z = x * y
    z = z ** 2
    return z

x = torch.randn(100, 100, requires_grad=True)
y = torch.randn(100, 100, requires_grad=True)

# Regular: stores all intermediate tensors
z = custom_forward(x, y)

# Checkpointed: only stores inputs, recomputes forward during backward
z = checkpoint(custom_forward, x, y)
Compute full Jacobian or Hessian matrices:
import torch
from torch.autograd.functional import jacobian, hessian

def f(x):
    return x ** 2

x = torch.tensor([1., 2., 3.])

# Jacobian: derivative matrix
jac = jacobian(f, x)
print(jac)  # diag([2., 4., 6.])

# Hessian: second derivative matrix
hess = hessian(f, x)
print(hess)  # diag([2., 2., 2.])

Performance Tips

  1. Use inference_mode() for inference:
    with torch.inference_mode():
        outputs = model(inputs)
    
  2. Clear gradients efficiently:
    # Better memory behavior
    model.zero_grad(set_to_none=True)
    # Instead of
    model.zero_grad()
    
  3. Use gradient checkpointing for large models:
    from torch.utils.checkpoint import checkpoint
    y = checkpoint(large_module, x)
    
  1. In-place operations can cause errors:
    x = torch.tensor([1., 2., 3.], requires_grad=True)
    y = x ** 2
    y += 1  # Error if you later need y.backward()
    # Better: y = y + 1
    
  2. Detach when mixing autograd and non-autograd:
    x = torch.randn(3, requires_grad=True)
    y = x.detach().numpy()  # Convert to numpy
    
  3. Memory leaks with retain_graph:
    # Only use retain_graph when needed
    loss.backward(retain_graph=True)
    # Clear gradients after last backward
    model.zero_grad(set_to_none=True)
    

torch Module

Core PyTorch functions

Tensor API

Tensor operations and methods

torch.nn

Neural network modules

torch.optim

Optimization algorithms