Overview
Autograd is PyTorch’s automatic differentiation engine that powers neural network training. The C++ autograd API augments ATen tensors with gradient tracking and reverse-mode differentiation capabilities.
Key Concepts
Computational Graph
When you perform operations on tensors with requires_grad=True, autograd records these operations to form a computational graph. Each tensor knows what operation created it, allowing gradients to flow backward through the graph.
#include <torch/torch.h>
torch::Tensor x = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;
torch::Tensor z = y * y * 3;
torch::Tensor out = z.mean();
// Graph: x -> (+2) -> y -> (*y*3) -> z -> mean() -> out
Gradient Computation
Calling .backward() on a tensor computes gradients of that tensor with respect to all tensors that have requires_grad=True in the computational graph.
out.backward(); // Compute gradients
std::cout << x.grad() << std::endl; // Access gradient of x
Creating Differentiable Tensors
Using torch:: Namespace
Only tensors created with torch:: factory functions (not at::) support autograd by default.
#include <torch/torch.h>
// Differentiable tensors
torch::Tensor a = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor b = torch::randn({2, 2}, torch::requires_grad());
// Check gradient tracking
assert(a.requires_grad()); // true
assert(b.requires_grad()); // true
Setting requires_grad
// Enable gradient tracking on existing tensor
torch::Tensor x = torch::randn({2, 2});
x.requires_grad_(true);
// Create with TensorOptions
auto options = torch::TensorOptions()
.dtype(torch::kFloat32)
.requires_grad(true);
torch::Tensor y = torch::zeros({2, 2}, options);
Detaching from Graph
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;
// Detach from computational graph
torch::Tensor y_detached = y.detach();
assert(!y_detached.requires_grad());
// No-grad block
{
torch::NoGradGuard no_grad;
torch::Tensor z = x * 2; // z doesn't require grad
assert(!z.requires_grad());
}
Computing Gradients
Basic Backward Pass
torch::Tensor x = torch::ones({2, 2}, torch::requires_grad());
torch::Tensor y = x + 2;
torch::Tensor z = y * y * 3;
torch::Tensor out = z.mean();
out.backward(); // Compute gradients
std::cout << x.grad() << std::endl;
// Output: gradient of out w.r.t. x
Backward with Gradient Argument
For non-scalar outputs, you must provide a gradient argument:
torch::Tensor x = torch::randn({3}, torch::requires_grad());
torch::Tensor y = x * 2;
// y is not a scalar, need gradient argument
torch::Tensor grad_output = torch::ones({3});
y.backward(grad_output);
std::cout << x.grad() << std::endl;
Using torch::autograd::backward
For more control, use the backward function directly:
#include <torch/csrc/autograd/engine.h>
using namespace torch::autograd;
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;
// Compute gradients
backward({z.sum()}, {});
std::cout << "x.grad: " << x.grad() << std::endl;
std::cout << "y.grad: " << y.grad() << std::endl;
Retaining Graph
By default, the computational graph is freed after .backward(). To run backward multiple times:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = x * x;
// First backward (retain graph)
y.sum().backward(/*gradient=*/{}, /*retain_graph=*/true);
auto grad1 = x.grad().clone();
// Second backward
y.sum().backward();
auto grad2 = x.grad();
// grad2 == 2 * grad1 (gradients accumulated)
Computing Gradients with grad()
The grad function computes and returns gradients without accumulating them:
#include <torch/csrc/autograd/engine.h>
using namespace torch::autograd;
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;
// Compute gradients without accumulating
auto grads = grad({z}, {x, y}, {torch::ones({2, 2})});
std::cout << "dx: " << grads[0] << std::endl; // gradient w.r.t. x
std::cout << "dy: " << grads[1] << std::endl; // gradient w.r.t. y
// x.grad() and y.grad() are still undefined
Higher-Order Gradients
Compute gradients of gradients:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = torch::randn({2, 2}, torch::requires_grad());
auto z = x + 2 * y + x * y;
// First-order gradients
z.sum().backward(/*gradient=*/{}, /*retain_graph=*/true, /*create_graph=*/true);
torch::Tensor x_grad = x.grad();
torch::Tensor y_grad = y.grad();
// Second-order gradients (Hessian-vector product)
torch::Tensor grad_sum = 2 * x_grad + y_grad;
auto hessian = grad({grad_sum}, {x}, {torch::ones({2, 2})}, /*retain_graph=*/true);
std::cout << "Hessian-vector product: " << hessian[0] << std::endl;
Controlling Gradient Computation
NoGradGuard
Disable gradient tracking for a code block:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
{
torch::NoGradGuard no_grad;
torch::Tensor y = x * 2; // No gradient tracking
assert(!y.requires_grad());
}
// Gradient tracking resumes
torch::Tensor z = x * 3;
assert(z.requires_grad());
GradMode
Control gradient mode programmatically:
// Check current mode
bool is_enabled = torch::GradMode::is_enabled();
// Disable gradients
torch::GradMode::set_enabled(false);
torch::Tensor y = x * 2; // No gradients
// Re-enable gradients
torch::GradMode::set_enabled(true);
InferenceMode (Recommended for Inference)
For better performance during inference, use InferenceMode:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
{
torch::InferenceMode guard;
torch::Tensor y = x * 2; // Faster, no autograd overhead
// Cannot call backward() here
}
InferenceMode provides better performance than NoGradGuard for inference workloads by completely disabling autograd machinery.
Gradient Accumulation
Accumulating Gradients
By default, calling .backward() multiple times accumulates gradients:
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
for (int i = 0; i < 3; i++) {
torch::Tensor y = x * i;
y.sum().backward(/*gradient=*/{}, /*retain_graph=*/true);
}
// x.grad() contains accumulated gradients from all iterations
Zero Gradients
Clear accumulated gradients:
// Method 1: Set to zero
if (x.grad().defined()) {
x.grad().zero_();
}
// Method 2: Set to undefined (more efficient)
if (x.grad().defined()) {
x.grad() = torch::Tensor();
}
// Method 3: Using zero_grad_() (sets to undefined by default)
x.mutable_grad() = torch::Tensor();
Custom Autograd Functions
Define custom backward behavior by subclassing torch::autograd::Function:
#include <torch/csrc/autograd/function.h>
class MyMultiply : public torch::autograd::Function<MyMultiply> {
public:
static torch::Tensor forward(
torch::autograd::AutogradContext* ctx,
torch::Tensor input,
torch::Tensor weight) {
ctx->save_for_backward({input, weight});
return input * weight;
}
static torch::autograd::tensor_list backward(
torch::autograd::AutogradContext* ctx,
torch::autograd::tensor_list grad_outputs) {
auto saved = ctx->get_saved_variables();
auto input = saved[0];
auto weight = saved[1];
auto grad_output = grad_outputs[0];
torch::Tensor grad_input = grad_output * weight;
torch::Tensor grad_weight = grad_output * input;
return {grad_input, grad_weight};
}
};
// Usage
torch::Tensor x = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor w = torch::randn({2, 2}, torch::requires_grad());
torch::Tensor y = MyMultiply::apply(x, w);
y.sum().backward();
Practical Examples
Linear Regression
#include <torch/torch.h>
int main() {
// Data
torch::Tensor x = torch::randn({100, 1});
torch::Tensor y = 3 * x + 2 + torch::randn({100, 1}) * 0.1;
// Parameters
torch::Tensor w = torch::randn({1, 1}, torch::requires_grad());
torch::Tensor b = torch::zeros({1}, torch::requires_grad());
// Training loop
float learning_rate = 0.01;
for (int epoch = 0; epoch < 100; epoch++) {
// Forward pass
torch::Tensor y_pred = x.mm(w) + b;
torch::Tensor loss = (y_pred - y).pow(2).mean();
// Backward pass
loss.backward();
// Update parameters
{
torch::NoGradGuard no_grad;
w -= learning_rate * w.grad();
b -= learning_rate * b.grad();
// Zero gradients
w.mutable_grad().zero_();
b.mutable_grad().zero_();
}
if (epoch % 10 == 0) {
std::cout << "Epoch " << epoch << ", Loss: " << loss.item<float>() << std::endl;
}
}
std::cout << "w: " << w << ", b: " << b << std::endl;
return 0;
}
Computing Jacobian
torch::Tensor compute_jacobian(torch::Tensor x, torch::Tensor y) {
// x: input tensor (n,)
// y: output tensor (m,)
int m = y.size(0);
int n = x.size(0);
torch::Tensor jacobian = torch::zeros({m, n});
for (int i = 0; i < m; i++) {
if (x.grad().defined()) {
x.mutable_grad().zero_();
}
torch::Tensor grad_output = torch::zeros_like(y);
grad_output[i] = 1;
y.backward(grad_output, /*retain_graph=*/true);
jacobian[i] = x.grad().clone();
}
return jacobian;
}
Best Practices
Use InferenceMode for Inference
When running inference, wrap code in torch::InferenceMode for better performance.{
torch::InferenceMode guard;
auto output = model->forward(input);
}
Clear Gradients Between Iterations
Always zero gradients before backward pass to prevent accumulation.optimizer.zero_grad();
loss.backward();
optimizer.step();
Use NoGradGuard for Parameter Updates
Prevent tracking gradients during parameter updates.{
torch::NoGradGuard no_grad;
param -= lr * param.grad();
}
Be Careful with retain_graph
Only use retain_graph=true when you need multiple backward passes, as it increases memory usage.
Check Gradient Definition
Before accessing gradients, verify they’re defined.if (x.grad().defined()) {
auto grad = x.grad();
}
Common Patterns
Gradient Clipping
// Clip gradients by norm
float max_norm = 1.0;
torch::nn::utils::clip_grad_norm_(model->parameters(), max_norm);
// Clip gradients by value
for (auto& param : model->parameters()) {
if (param.grad().defined()) {
param.mutable_grad().clamp_(-1.0, 1.0);
}
}
Freezing Parameters
// Freeze specific parameters
for (auto& param : model->named_parameters()) {
if (param.key().find("conv") != std::string::npos) {
param.value().set_requires_grad(false);
}
}
Gradient Checkpointing
// Save memory by not storing intermediate activations
torch::Tensor checkpoint_forward(
std::function<torch::Tensor(torch::Tensor)> func,
torch::Tensor input) {
torch::Tensor output;
{
torch::NoGradGuard no_grad;
output = func(input.detach());
}
return output;
}
Next Steps