C++ Autograd API

Overview

Autograd is PyTorch’s automatic differentiation engine that powers neural network training. The C++ autograd API augments ATen tensors with gradient tracking and reverse-mode differentiation capabilities.

Key Concepts

Computational Graph

When you perform operations on tensors with requires_grad=True, autograd records these operations to form a computational graph. Each tensor knows what operation created it, allowing gradients to flow backward through the graph.

#include <torch/torch.h>

torch::Tensor x = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;
torch::Tensor z = y * y * 3;
torch::Tensor out = z.mean();

// Graph: x -> (+2) -> y -> (*y*3) -> z -> mean() -> out

Gradient Computation

Calling .backward() on a tensor computes gradients of that tensor with respect to all tensors that have requires_grad=True in the computational graph.

out.backward();  // Compute gradients
std::cout << x.grad() << std::endl;  // Access gradient of x

Creating Differentiable Tensors

Using torch:: Namespace

Only tensors created with torch:: factory functions (not at::) support autograd by default.

#include <torch/torch.h>

// Differentiable tensors
torch::Tensor a = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor b = torch::randn({2, 2}, torch::requires_grad());

// Check gradient tracking
assert(a.requires_grad());  // true
assert(b.requires_grad());  // true

Setting requires_grad

// Enable gradient tracking on existing tensor
torch::Tensor x = torch::randn({2, 2});
x.requires_grad_(true);

// Create with TensorOptions
auto options = torch::TensorOptions()
    .dtype(torch::kFloat32)
    .requires_grad(true);
torch::Tensor y = torch::zeros({2, 2}, options);

Detaching from Graph

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;

// Detach from computational graph
torch::Tensor y_detached = y.detach();
assert(!y_detached.requires_grad());

// No-grad block
{
  torch::NoGradGuard no_grad;
  torch::Tensor z = x * 2;  // z doesn't require grad
  assert(!z.requires_grad());
}

Computing Gradients

Basic Backward Pass

torch::Tensor x = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;
torch::Tensor z = y * y * 3;
torch::Tensor out = z.mean();

out.backward();  // Compute gradients

std::cout << x.grad() << std::endl;
// Output: gradient of out w.r.t. x

Backward with Gradient Argument

For non-scalar outputs, you must provide a gradient argument:

torch::Tensor x = torch::randn({3}, torch::requires_grad());
torch::Tensor y = x * 2;

// y is not a scalar, need gradient argument
torch::Tensor grad_output = torch::ones({3});
y.backward(grad_output);

std::cout << x.grad() << std::endl;

Using torch::autograd::backward

For more control, use the backward function directly:

#include <torch/csrc/autograd/engine.h>
using namespace torch::autograd;

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;

// Compute gradients
backward({z.sum()}, {});

std::cout << "x.grad: " << x.grad() << std::endl;
std::cout << "y.grad: " << y.grad() << std::endl;

Retaining Graph

By default, the computational graph is freed after .backward(). To run backward multiple times:

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = x * x;

// First backward (retain graph)
y.sum().backward(/*gradient=*/{}, /*retain_graph=*/true);
auto grad1 = x.grad().clone();

// Second backward
y.sum().backward();
auto grad2 = x.grad();

// grad2 == 2 * grad1 (gradients accumulated)

Computing Gradients with grad()

The grad function computes and returns gradients without accumulating them:

#include <torch/csrc/autograd/engine.h>
using namespace torch::autograd;

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;

// Compute gradients without accumulating
auto grads = grad({z}, {x, y}, {torch::ones({2, 2})});

std::cout << "dx: " << grads[0] << std::endl;  // gradient w.r.t. x
std::cout << "dy: " << grads[1] << std::endl;  // gradient w.r.t. y

// x.grad() and y.grad() are still undefined

Higher-Order Gradients

Compute gradients of gradients:

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;

// First-order gradients
z.sum().backward(/*gradient=*/{}, /*retain_graph=*/true, /*create_graph=*/true);

torch::Tensor x_grad = x.grad();
torch::Tensor y_grad = y.grad();

// Second-order gradients (Hessian-vector product)
torch::Tensor grad_sum = 2 * x_grad + y_grad;
auto hessian = grad({grad_sum}, {x}, {torch::ones({2, 2})}, /*retain_graph=*/true);

std::cout << "Hessian-vector product: " << hessian[0] << std::endl;

Controlling Gradient Computation

NoGradGuard

Disable gradient tracking for a code block:

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());

{
  torch::NoGradGuard no_grad;
  torch::Tensor y = x * 2;  // No gradient tracking
  assert(!y.requires_grad());
}

// Gradient tracking resumes
torch::Tensor z = x * 3;
assert(z.requires_grad());

GradMode

Control gradient mode programmatically:

// Check current mode
bool is_enabled = torch::GradMode::is_enabled();

// Disable gradients
torch::GradMode::set_enabled(false);
torch::Tensor y = x * 2;  // No gradients

// Re-enable gradients
torch::GradMode::set_enabled(true);

InferenceMode (Recommended for Inference)

For better performance during inference, use InferenceMode:

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());

{
  torch::InferenceMode guard;
  torch::Tensor y = x * 2;  // Faster, no autograd overhead
  // Cannot call backward() here
}

InferenceMode provides better performance than NoGradGuard for inference workloads by completely disabling autograd machinery.

Gradient Accumulation

Accumulating Gradients

By default, calling .backward() multiple times accumulates gradients:

torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());

for (int i = 0; i < 3; i++) {
  torch::Tensor y = x * i;
  y.sum().backward(/*gradient=*/{}, /*retain_graph=*/true);
}

// x.grad() contains accumulated gradients from all iterations

Zero Gradients

Clear accumulated gradients:

// Method 1: Set to zero
if (x.grad().defined()) {
  x.grad().zero_();
}

// Method 2: Set to undefined (more efficient)
if (x.grad().defined()) {
  x.grad() = torch::Tensor();
}

// Method 3: Using zero_grad_() (sets to undefined by default)
x.mutable_grad() = torch::Tensor();

Custom Autograd Functions

Define custom backward behavior by subclassing torch::autograd::Function:

#include <torch/csrc/autograd/function.h>

class MyMultiply : public torch::autograd::Function<MyMultiply> {
 public:
  static torch::Tensor forward(
      torch::autograd::AutogradContext* ctx,
      torch::Tensor input,
      torch::Tensor weight) {
    ctx->save_for_backward({input, weight});
    return input * weight;
  }

  static torch::autograd::tensor_list backward(
      torch::autograd::AutogradContext* ctx,
      torch::autograd::tensor_list grad_outputs) {
    auto saved = ctx->get_saved_variables();
    auto input = saved[0];
    auto weight = saved[1];
    auto grad_output = grad_outputs[0];

    torch::Tensor grad_input = grad_output * weight;
    torch::Tensor grad_weight = grad_output * input;

    return {grad_input, grad_weight};
  }
};

// Usage
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor w = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = MyMultiply::apply(x, w);
y.sum().backward();

Practical Examples

Linear Regression

#include <torch/torch.h>

int main() {
  // Data
  torch::Tensor x = torch::randn({100, 1});
  torch::Tensor y = 3 * x + 2 + torch::randn({100, 1}) * 0.1;

  // Parameters
  torch::Tensor w = torch::randn({1, 1}, torch::requires_grad());
  torch::Tensor b = torch::zeros({1}, torch::requires_grad());

  // Training loop
  float learning_rate = 0.01;
  for (int epoch = 0; epoch < 100; epoch++) {
    // Forward pass
    torch::Tensor y_pred = x.mm(w) + b;
    torch::Tensor loss = (y_pred - y).pow(2).mean();

    // Backward pass
    loss.backward();

    // Update parameters
    {
      torch::NoGradGuard no_grad;
      w -= learning_rate * w.grad();
      b -= learning_rate * b.grad();

      // Zero gradients
      w.mutable_grad().zero_();
      b.mutable_grad().zero_();
    }

    if (epoch % 10 == 0) {
      std::cout << "Epoch " << epoch << ", Loss: " << loss.item<float>() << std::endl;
    }
  }

  std::cout << "w: " << w << ", b: " << b << std::endl;
  return 0;
}

Computing Jacobian

torch::Tensor compute_jacobian(torch::Tensor x, torch::Tensor y) {
  // x: input tensor (n,)
  // y: output tensor (m,)
  int m = y.size(0);
  int n = x.size(0);
  
  torch::Tensor jacobian = torch::zeros({m, n});
  
  for (int i = 0; i < m; i++) {
    if (x.grad().defined()) {
      x.mutable_grad().zero_();
    }
    
    torch::Tensor grad_output = torch::zeros_like(y);
    grad_output[i] = 1;
    
    y.backward(grad_output, /*retain_graph=*/true);
    jacobian[i] = x.grad().clone();
  }
  
  return jacobian;
}

Best Practices

Use InferenceMode for Inference

When running inference, wrap code in torch::InferenceMode for better performance.

{
  torch::InferenceMode guard;
  auto output = model->forward(input);
}

Clear Gradients Between Iterations

Always zero gradients before backward pass to prevent accumulation.

optimizer.zero_grad();
loss.backward();
optimizer.step();

Use NoGradGuard for Parameter Updates

Prevent tracking gradients during parameter updates.

{
  torch::NoGradGuard no_grad;
  param -= lr * param.grad();
}

Be Careful with retain_graph

Only use retain_graph=true when you need multiple backward passes, as it increases memory usage.

Check Gradient Definition

Before accessing gradients, verify they’re defined.

if (x.grad().defined()) {
  auto grad = x.grad();
}

Common Patterns

Gradient Clipping

// Clip gradients by norm
float max_norm = 1.0;
torch::nn::utils::clip_grad_norm_(model->parameters(), max_norm);

// Clip gradients by value
for (auto& param : model->parameters()) {
  if (param.grad().defined()) {
    param.mutable_grad().clamp_(-1.0, 1.0);
  }
}

Freezing Parameters

// Freeze specific parameters
for (auto& param : model->named_parameters()) {
  if (param.key().find("conv") != std::string::npos) {
    param.value().set_requires_grad(false);
  }
}

Gradient Checkpointing

// Save memory by not storing intermediate activations
torch::Tensor checkpoint_forward(
    std::function<torch::Tensor(torch::Tensor)> func,
    torch::Tensor input) {
  torch::Tensor output;
  {
    torch::NoGradGuard no_grad;
    output = func(input.detach());
  }
  return output;
}

Next Steps

Module API - Build neural network models with automatic gradient handling
Optimizers - Use built-in optimization algorithms
Official Autograd Documentation

​Overview

​Key Concepts

​Computational Graph

​Gradient Computation

​Creating Differentiable Tensors

​Using torch:: Namespace

​Setting requires_grad

​Detaching from Graph

​Computing Gradients

​Basic Backward Pass

​Backward with Gradient Argument

​Using torch::autograd::backward

​Retaining Graph

​Computing Gradients with grad()

​Higher-Order Gradients

​Controlling Gradient Computation

​NoGradGuard

​GradMode

​InferenceMode (Recommended for Inference)

​Gradient Accumulation

​Accumulating Gradients

​Zero Gradients

​Custom Autograd Functions

​Practical Examples

​Linear Regression

​Computing Jacobian

​Best Practices

​Common Patterns

​Gradient Clipping

​Freezing Parameters

​Gradient Checkpointing

​Next Steps

Overview

Key Concepts

Computational Graph

Gradient Computation

Creating Differentiable Tensors

Using torch:: Namespace

Setting requires_grad

Detaching from Graph

Computing Gradients

Basic Backward Pass

Backward with Gradient Argument

Using torch::autograd::backward

Retaining Graph

Computing Gradients with grad()

Higher-Order Gradients

Controlling Gradient Computation

NoGradGuard

GradMode

InferenceMode (Recommended for Inference)

Gradient Accumulation

Accumulating Gradients

Zero Gradients

Custom Autograd Functions

Practical Examples

Linear Regression

Computing Jacobian

Best Practices

Common Patterns

Gradient Clipping

Freezing Parameters

Gradient Checkpointing

Next Steps