Skip to main content

Overview

Optimizers update model parameters based on computed gradients. This guide covers PyTorch’s built-in optimizers and learning rate scheduling strategies.

Common Optimizers

SGD (Stochastic Gradient Descent)

import torch.optim as optim

# Basic SGD
optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0,
    weight_decay=0,
    nesterov=False
)
Parameters:
  • lr (float): Learning rate (default: 1e-3)
  • momentum (float): Momentum factor (default: 0)
  • weight_decay (float): L2 penalty coefficient (default: 0)
  • dampening (float): Dampening for momentum (default: 0)
  • nesterov (bool): Enable Nesterov momentum (default: False)
# SGD with momentum (recommended)
optimizer = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True
)
SGD with momentum often achieves better generalization than adaptive optimizers, especially for computer vision tasks. Use lr=0.1 with momentum=0.9 as a starting point.

Adam

optimizer = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,
    amsgrad=False
)
Parameters:
  • lr (float or Tensor): Learning rate (default: 1e-3)
  • betas (tuple[float, float]): Coefficients for computing running averages (default: (0.9, 0.999))
  • eps (float): Term added for numerical stability (default: 1e-8)
  • weight_decay (float): L2 penalty (default: 0)
  • amsgrad (bool): Use AMSGrad variant (default: False)
  • foreach (bool): Use multi-tensor implementation (default: None)
  • maximize (bool): Maximize parameters instead of minimize (default: False)
  • capturable (bool): Safe for CUDA graphs (default: False)
  • differentiable (bool): Autograd through optimizer step (default: False)
  • fused (bool): Use fused kernel (default: None)
# Typical Adam configuration
optimizer = optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    weight_decay=0
)

AdamW

AdamW decouples weight decay from the gradient update:
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=1e-2,  # Note: higher default than Adam
    amsgrad=False
)
Parameters:
  • lr (float or Tensor): Learning rate (default: 1e-3)
  • betas (tuple[float, float]): Coefficients for running averages (default: (0.9, 0.999))
  • eps (float): Numerical stability term (default: 1e-8)
  • weight_decay (float): Decoupled weight decay (default: 1e-2)
  • amsgrad (bool): Use AMSGrad variant (default: False)
  • maximize (bool): Maximize instead of minimize (default: False)
  • foreach (bool): Multi-tensor implementation (default: None)
  • capturable (bool): CUDA graph safe (default: False)
  • differentiable (bool): Enable autograd through step (default: False)
  • fused (bool): Use fused kernel (default: None)
# Better weight decay regularization
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2
)
Prefer AdamW over Adam when using weight decay. Adam’s weight decay implementation differs from L2 regularization, while AdamW correctly decouples them.

Other Optimizers

optimizer = optim.RMSprop(
    model.parameters(),
    lr=1e-2,
    alpha=0.99,
    eps=1e-8,
    weight_decay=0,
    momentum=0
)

Optimizer Comparison

OptimizerBest ForLearning RateProsCons
SGD + MomentumComputer Vision, proven architectures0.1 - 0.01Best generalization, stableRequires tuning, slower convergence
AdamQuick prototyping, NLP1e-3 - 1e-4Fast convergence, adaptiveCan overfit, worse generalization
AdamWTransformers, modern architectures1e-3 - 1e-4Proper weight decay, fastMay need tuning
RMSpropRNNs, non-stationary objectives1e-2 - 1e-3Handles non-stationary wellLess common now

Learning Rate Schedulers

StepLR

Decay learning rate by gamma every step_size epochs:
from torch.optim.lr_scheduler import StepLR

optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

for epoch in range(100):
    train_epoch()
    validate()
    scheduler.step()  # Update learning rate

MultiStepLR

Decay learning rate at specific milestones:
from torch.optim.lr_scheduler import MultiStepLR

optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)

# lr = 0.1     for epochs 0-29
# lr = 0.01    for epochs 30-79  
# lr = 0.001   for epochs 80+

CosineAnnealingLR

Cosine annealing schedule:
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-5)

for epoch in range(100):
    train_epoch()
    scheduler.step()
Cosine annealing is popular for training from scratch. It smoothly reduces the learning rate following a cosine curve.

ReduceLROnPlateau

Reduce learning rate when a metric stops improving:
from torch.optim.lr_scheduler import ReduceLROnPlateau

optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',           # 'min' for loss, 'max' for accuracy
    factor=0.1,           # Multiply lr by 0.1
    patience=10,          # Wait 10 epochs before reducing
    verbose=True,
    min_lr=1e-6
)

for epoch in range(100):
    train_epoch()
    val_loss = validate()
    scheduler.step(val_loss)  # Pass metric to scheduler

OneCycleLR

One cycle learning rate policy (super-convergence):
from torch.optim.lr_scheduler import OneCycleLR

optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = OneCycleLR(
    optimizer,
    max_lr=0.1,
    epochs=100,
    steps_per_epoch=len(train_loader),
    pct_start=0.3,        # Percentage of cycle spent increasing lr
    anneal_strategy='cos'
)

for epoch in range(100):
    for batch in train_loader:
        train_batch(batch)
        scheduler.step()  # Step after each batch!
With OneCycleLR, call scheduler.step() after each batch, not after each epoch.

ExponentialLR

from torch.optim.lr_scheduler import ExponentialLR

optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = ExponentialLR(optimizer, gamma=0.95)

# lr *= 0.95 every epoch

CosineAnnealingWarmRestarts

from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts

optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,      # First restart after 10 epochs
    T_mult=2,    # Double the restart interval each time
    eta_min=1e-5
)

Custom Learning Rate Schedules

Warmup + Cosine Decay

import math
from torch.optim.lr_scheduler import LambdaLR

def get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps,
    num_training_steps,
    num_cycles=0.5
):
    def lr_lambda(current_step):
        # Warmup
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        
        # Cosine decay
        progress = float(current_step - num_warmup_steps) / float(
            max(1, num_training_steps - num_warmup_steps)
        )
        return max(
            0.0,
            0.5 * (1.0 + math.cos(math.pi * num_cycles * 2.0 * progress))
        )
    
    return LambdaLR(optimizer, lr_lambda)

# Usage
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
num_epochs = 10
num_training_steps = num_epochs * len(train_loader)
num_warmup_steps = num_training_steps // 10  # 10% warmup

scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps,
    num_training_steps
)

for epoch in range(num_epochs):
    for batch in train_loader:
        train_batch(batch)
        scheduler.step()

Linear Warmup + Linear Decay

from torch.optim.lr_scheduler import LambdaLR

def get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps,
    num_training_steps
):
    def lr_lambda(current_step):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        return max(
            0.0,
            float(num_training_steps - current_step) / float(
                max(1, num_training_steps - num_warmup_steps)
            ),
        )
    
    return LambdaLR(optimizer, lr_lambda)

Per-Parameter Options

Different learning rates for different layers:
# Different lr for backbone and head
optimizer = optim.SGD([
    {'params': model.backbone.parameters(), 'lr': 1e-3},
    {'params': model.head.parameters(), 'lr': 1e-2}
], momentum=0.9)

# Or with weight decay only on certain parameters
optimizer = optim.AdamW([
    {'params': model.conv_layers.parameters(), 'weight_decay': 1e-4},
    {'params': model.bn_layers.parameters(), 'weight_decay': 0}  # No decay on BN
], lr=1e-3)

Learning Rate Finder

import matplotlib.pyplot as plt
import numpy as np

def find_lr(
    model,
    train_loader,
    optimizer,
    criterion,
    device,
    start_lr=1e-7,
    end_lr=10,
    num_iter=100
):
    """Learning rate range test."""
    model.train()
    
    # Calculate lr multiplication factor
    lr_mult = (end_lr / start_lr) ** (1 / num_iter)
    
    # Set starting lr
    for param_group in optimizer.param_groups:
        param_group['lr'] = start_lr
    
    lrs = []
    losses = []
    best_loss = float('inf')
    
    iterator = iter(train_loader)
    
    for iteration in range(num_iter):
        try:
            inputs, targets = next(iterator)
        except StopIteration:
            iterator = iter(train_loader)
            inputs, targets = next(iterator)
        
        inputs, targets = inputs.to(device), targets.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Record
        lr = optimizer.param_groups[0]['lr']
        lrs.append(lr)
        losses.append(loss.item())
        
        # Update best loss
        if loss.item() < best_loss:
            best_loss = loss.item()
        
        # Stop if loss explodes
        if loss.item() > 4 * best_loss:
            break
        
        # Increase lr
        for param_group in optimizer.param_groups:
            param_group['lr'] *= lr_mult
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Learning Rate Finder')
    plt.grid(True)
    plt.show()
    
    return lrs, losses

# Usage
model = YourModel().to(device)
optimizer = optim.SGD(model.parameters(), lr=1e-7)
criterion = nn.CrossEntropyLoss()

lrs, losses = find_lr(model, train_loader, optimizer, criterion, device)

Complete Training Example

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

# Model and optimizer setup
model = YourModel().cuda()
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    weight_decay=1e-2
)

# Scheduler
scheduler = CosineAnnealingLR(
    optimizer,
    T_max=100,
    eta_min=1e-6
)

criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    model.train()
    
    for inputs, targets in train_loader:
        inputs, targets = inputs.cuda(), targets.cuda()
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
    
    # Validation
    model.eval()
    val_loss = validate(model, val_loader, criterion)
    
    # Step scheduler
    scheduler.step()
    
    # Print stats
    current_lr = optimizer.param_groups[0]['lr']
    print(f'Epoch {epoch+1}: LR={current_lr:.6f}, Val Loss={val_loss:.4f}')

Best Practices

  • SGD + Momentum: Best for CNNs when training from scratch (ResNet, EfficientNet)
  • Adam: Good default for quick experimentation
  • AdamW: Best for Transformers and modern architectures (BERT, ViT)
  • RMSprop: Consider for RNNs and non-stationary problems
  • SGD: Start with 0.1, use schedulers to decay
  • Adam/AdamW: Start with 1e-3 or 1e-4
  • Fine-tuning: Use 10-100x smaller learning rate than training from scratch
  • Large batch sizes: Scale learning rate linearly (batch 256 → lr × 2 if doubling batch)
  • Use OneCycleLR for fast training (great results in fewer epochs)
  • Use CosineAnnealingLR for training from scratch
  • Use ReduceLROnPlateau when you’re unsure about the schedule
  • Add warmup for large learning rates or large batch sizes
  • Always monitor learning rate during training

Next Steps