Skip to main content

Overview

Autograd is PyTorch’s automatic differentiation engine that powers neural network training. The C++ autograd API augments ATen tensors with gradient tracking and reverse-mode differentiation capabilities.

Key Concepts

Computational Graph

When you perform operations on tensors with requires_grad=True, autograd records these operations to form a computational graph. Each tensor knows what operation created it, allowing gradients to flow backward through the graph.
#include <torch/torch.h>

torch::Tensor x = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;
torch::Tensor z = y * y * 3;
torch::Tensor out = z.mean();

// Graph: x -> (+2) -> y -> (*y*3) -> z -> mean() -> out

Gradient Computation

Calling .backward() on a tensor computes gradients of that tensor with respect to all tensors that have requires_grad=True in the computational graph.
out.backward();  // Compute gradients
std::cout << x.grad() << std::endl;  // Access gradient of x

Creating Differentiable Tensors

Using torch:: Namespace

Only tensors created with torch:: factory functions (not at::) support autograd by default.
#include <torch/torch.h>

// Differentiable tensors
torch::Tensor a = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor b = torch::randn({2, 2}, torch::requires_grad());

// Check gradient tracking
assert(a.requires_grad());  // true
assert(b.requires_grad());  // true

Setting requires_grad

// Enable gradient tracking on existing tensor
torch::Tensor x = torch::randn({2, 2});
x.requires_grad_(true);

// Create with TensorOptions
auto options = torch::TensorOptions()
    .dtype(torch::kFloat32)
    .requires_grad(true);
torch::Tensor y = torch::zeros({2, 2}, options);

Detaching from Graph

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;

// Detach from computational graph
torch::Tensor y_detached = y.detach();
assert(!y_detached.requires_grad());

// No-grad block
{
  torch::NoGradGuard no_grad;
  torch::Tensor z = x * 2;  // z doesn't require grad
  assert(!z.requires_grad());
}

Computing Gradients

Basic Backward Pass

torch::Tensor x = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;
torch::Tensor z = y * y * 3;
torch::Tensor out = z.mean();

out.backward();  // Compute gradients

std::cout << x.grad() << std::endl;
// Output: gradient of out w.r.t. x

Backward with Gradient Argument

For non-scalar outputs, you must provide a gradient argument:
torch::Tensor x = torch::randn({3}, torch::requires_grad());
torch::Tensor y = x * 2;

// y is not a scalar, need gradient argument
torch::Tensor grad_output = torch::ones({3});
y.backward(grad_output);

std::cout << x.grad() << std::endl;

Using torch::autograd::backward

For more control, use the backward function directly:
#include <torch/csrc/autograd/engine.h>
using namespace torch::autograd;

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;

// Compute gradients
backward({z.sum()}, {});

std::cout << "x.grad: " << x.grad() << std::endl;
std::cout << "y.grad: " << y.grad() << std::endl;

Retaining Graph

By default, the computational graph is freed after .backward(). To run backward multiple times:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = x * x;

// First backward (retain graph)
y.sum().backward(/*gradient=*/{}, /*retain_graph=*/true);
auto grad1 = x.grad().clone();

// Second backward
y.sum().backward();
auto grad2 = x.grad();

// grad2 == 2 * grad1 (gradients accumulated)

Computing Gradients with grad()

The grad function computes and returns gradients without accumulating them:
#include <torch/csrc/autograd/engine.h>
using namespace torch::autograd;

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;

// Compute gradients without accumulating
auto grads = grad({z}, {x, y}, {torch::ones({2, 2})});

std::cout << "dx: " << grads[0] << std::endl;  // gradient w.r.t. x
std::cout << "dy: " << grads[1] << std::endl;  // gradient w.r.t. y

// x.grad() and y.grad() are still undefined

Higher-Order Gradients

Compute gradients of gradients:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;

// First-order gradients
z.sum().backward(/*gradient=*/{}, /*retain_graph=*/true, /*create_graph=*/true);

torch::Tensor x_grad = x.grad();
torch::Tensor y_grad = y.grad();

// Second-order gradients (Hessian-vector product)
torch::Tensor grad_sum = 2 * x_grad + y_grad;
auto hessian = grad({grad_sum}, {x}, {torch::ones({2, 2})}, /*retain_graph=*/true);

std::cout << "Hessian-vector product: " << hessian[0] << std::endl;

Controlling Gradient Computation

NoGradGuard

Disable gradient tracking for a code block:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());

{
  torch::NoGradGuard no_grad;
  torch::Tensor y = x * 2;  // No gradient tracking
  assert(!y.requires_grad());
}

// Gradient tracking resumes
torch::Tensor z = x * 3;
assert(z.requires_grad());

GradMode

Control gradient mode programmatically:
// Check current mode
bool is_enabled = torch::GradMode::is_enabled();

// Disable gradients
torch::GradMode::set_enabled(false);
torch::Tensor y = x * 2;  // No gradients

// Re-enable gradients
torch::GradMode::set_enabled(true);
For better performance during inference, use InferenceMode:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());

{
  torch::InferenceMode guard;
  torch::Tensor y = x * 2;  // Faster, no autograd overhead
  // Cannot call backward() here
}
InferenceMode provides better performance than NoGradGuard for inference workloads by completely disabling autograd machinery.

Gradient Accumulation

Accumulating Gradients

By default, calling .backward() multiple times accumulates gradients:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());

for (int i = 0; i < 3; i++) {
  torch::Tensor y = x * i;
  y.sum().backward(/*gradient=*/{}, /*retain_graph=*/true);
}

// x.grad() contains accumulated gradients from all iterations

Zero Gradients

Clear accumulated gradients:
// Method 1: Set to zero
if (x.grad().defined()) {
  x.grad().zero_();
}

// Method 2: Set to undefined (more efficient)
if (x.grad().defined()) {
  x.grad() = torch::Tensor();
}

// Method 3: Using zero_grad_() (sets to undefined by default)
x.mutable_grad() = torch::Tensor();

Custom Autograd Functions

Define custom backward behavior by subclassing torch::autograd::Function:
#include <torch/csrc/autograd/function.h>

class MyMultiply : public torch::autograd::Function<MyMultiply> {
 public:
  static torch::Tensor forward(
      torch::autograd::AutogradContext* ctx,
      torch::Tensor input,
      torch::Tensor weight) {
    ctx->save_for_backward({input, weight});
    return input * weight;
  }

  static torch::autograd::tensor_list backward(
      torch::autograd::AutogradContext* ctx,
      torch::autograd::tensor_list grad_outputs) {
    auto saved = ctx->get_saved_variables();
    auto input = saved[0];
    auto weight = saved[1];
    auto grad_output = grad_outputs[0];

    torch::Tensor grad_input = grad_output * weight;
    torch::Tensor grad_weight = grad_output * input;

    return {grad_input, grad_weight};
  }
};

// Usage
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor w = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = MyMultiply::apply(x, w);
y.sum().backward();

Practical Examples

Linear Regression

#include <torch/torch.h>

int main() {
  // Data
  torch::Tensor x = torch::randn({100, 1});
  torch::Tensor y = 3 * x + 2 + torch::randn({100, 1}) * 0.1;

  // Parameters
  torch::Tensor w = torch::randn({1, 1}, torch::requires_grad());
  torch::Tensor b = torch::zeros({1}, torch::requires_grad());

  // Training loop
  float learning_rate = 0.01;
  for (int epoch = 0; epoch < 100; epoch++) {
    // Forward pass
    torch::Tensor y_pred = x.mm(w) + b;
    torch::Tensor loss = (y_pred - y).pow(2).mean();

    // Backward pass
    loss.backward();

    // Update parameters
    {
      torch::NoGradGuard no_grad;
      w -= learning_rate * w.grad();
      b -= learning_rate * b.grad();

      // Zero gradients
      w.mutable_grad().zero_();
      b.mutable_grad().zero_();
    }

    if (epoch % 10 == 0) {
      std::cout << "Epoch " << epoch << ", Loss: " << loss.item<float>() << std::endl;
    }
  }

  std::cout << "w: " << w << ", b: " << b << std::endl;
  return 0;
}

Computing Jacobian

torch::Tensor compute_jacobian(torch::Tensor x, torch::Tensor y) {
  // x: input tensor (n,)
  // y: output tensor (m,)
  int m = y.size(0);
  int n = x.size(0);
  
  torch::Tensor jacobian = torch::zeros({m, n});
  
  for (int i = 0; i < m; i++) {
    if (x.grad().defined()) {
      x.mutable_grad().zero_();
    }
    
    torch::Tensor grad_output = torch::zeros_like(y);
    grad_output[i] = 1;
    
    y.backward(grad_output, /*retain_graph=*/true);
    jacobian[i] = x.grad().clone();
  }
  
  return jacobian;
}

Best Practices

1

Use InferenceMode for Inference

When running inference, wrap code in torch::InferenceMode for better performance.
{
  torch::InferenceMode guard;
  auto output = model->forward(input);
}
2

Clear Gradients Between Iterations

Always zero gradients before backward pass to prevent accumulation.
optimizer.zero_grad();
loss.backward();
optimizer.step();
3

Use NoGradGuard for Parameter Updates

Prevent tracking gradients during parameter updates.
{
  torch::NoGradGuard no_grad;
  param -= lr * param.grad();
}
4

Be Careful with retain_graph

Only use retain_graph=true when you need multiple backward passes, as it increases memory usage.
5

Check Gradient Definition

Before accessing gradients, verify they’re defined.
if (x.grad().defined()) {
  auto grad = x.grad();
}

Common Patterns

Gradient Clipping

// Clip gradients by norm
float max_norm = 1.0;
torch::nn::utils::clip_grad_norm_(model->parameters(), max_norm);

// Clip gradients by value
for (auto& param : model->parameters()) {
  if (param.grad().defined()) {
    param.mutable_grad().clamp_(-1.0, 1.0);
  }
}

Freezing Parameters

// Freeze specific parameters
for (auto& param : model->named_parameters()) {
  if (param.key().find("conv") != std::string::npos) {
    param.value().set_requires_grad(false);
  }
}

Gradient Checkpointing

// Save memory by not storing intermediate activations
torch::Tensor checkpoint_forward(
    std::function<torch::Tensor(torch::Tensor)> func,
    torch::Tensor input) {
  torch::Tensor output;
  {
    torch::NoGradGuard no_grad;
    output = func(input.detach());
  }
  return output;
}

Next Steps