Quantization
Thetorch.quantization module provides tools to convert floating-point models to quantized versions with reduced precision (INT8) for faster inference and smaller model sizes.
Quantization Modes
PyTorch supports three types of quantization:- Dynamic Quantization - Weights are quantized ahead of time, activations are quantized dynamically at runtime
- Static Quantization - Weights and activations are quantized based on observed data distributions
- Quantization-Aware Training (QAT) - Quantization is simulated during training for better accuracy
Core Functions
quantize
Float model to be quantized.
Function to run the model on sample data for calibration.
Arguments for the calibration function.
Whether to modify the model in-place.
Quantized model
quantize_dynamic
Float model to be quantized.
Either a set of module types or a dict mapping module types to QConfig.
Quantized data type for weights. Options:
torch.qint8, torch.float16.Whether to modify the model in-place.
quantize_qat
Float model to be trained with quantization awareness.
Function to train the model.
Arguments for the training function.
Preparation Functions
prepare
Float model to be prepared.
Whether to modify the model in-place.
prepare_qat
Float model to be prepared for QAT.
convert
Prepared model to be converted.
Whether to modify the model in-place.
Whether to remove qconfig after conversion.
QConfig
QConfig
Observer for activations.
Observer for weights.
Pre-defined QConfigs
Observers
MinMaxObserver
Quantized data type.
Quantization scheme to use.
HistogramObserver
Number of histogram bins.
Fake Quantization
FakeQuantize
Quantization Stubs
QuantStub
DeQuantStub
Fusion
fuse_modules
Model containing modules to fuse.
List of module names to fuse, e.g.,
['conv', 'bn', 'relu'].Whether to modify the model in-place.
Example Usage
Best Practices
Choosing Quantization Type
Choosing Quantization Type
- Dynamic Quantization: Use for models with large matrix multiplications (e.g., LSTMs, Transformers). Easy to apply, minimal accuracy loss.
- Static Quantization: Best for CNNs and models where activation distributions are consistent. Requires calibration data.
- QAT: Use when static quantization shows accuracy degradation. Provides best accuracy but requires retraining.
Performance Tips
Performance Tips
- Fuse Conv-BN-ReLU sequences before quantization for better performance
- Use per-channel quantization for weights when possible
- Start with dynamic quantization, move to static/QAT only if needed
- Test on target hardware as quantized operations may have different performance characteristics
Debugging Quantization
Debugging Quantization
- Compare outputs between float and quantized models on sample data
- Use
torch.quantization.get_observer_dict()to inspect observer statistics - Gradually quantize layers to identify problematic operations
- Consider using higher precision (e.g., fp16) for sensitive layers