More Efficient Inference Through Model Quantization
Leverage lower precision representations of ML models to speed up inference on CPUs.
In part one of this two-part series on more efficient model inference on CPUs, I compared several methods that optimize models by converting them into compute graphs, utilizing more efficient operations, and fusing multiple model layers into a single one. None of these methods change the parameters of the model. Therefore, they do not affect the model’s performance metrics like its accuracy. However, if one is willing to tolerate some performance degradation, then other optimizations to speed up inference exist. In this article, I will take a closer look at one such technique, namely the quantization of the model's parameters to 8 bits.
Optimization Strategies Overview
Quantization is the most well known, but only one of several approaches that change the model's parameters to enable faster inference. Another popular approach is model pruning. As the name suggests, with model pruning, entire parameters are removed, often the ones with the lowest magnitude. This approach works both during training by iteratively removing parameters or after the training by removing a set number of parameters with subsequent fine-tuning
a> to compensate for the loss in quality induced by the pruning. Even more complex methods, using second-order derivatives in their pruning criterion, perform well in post-training without the need for fine-tuning but come with significant costs of iterating to find the next parameter to remove. All of these pruning methods have in common that they result in substantially smaller models that run faster with only a small loss in performance.Another popular approach is model distillation, which trains a smaller model on the softmax outputs of a larger (ensemble) teacher model. The premise is that the probability distribution of the teacher's outputs contains valuable information that helps the distilled model to better generalize than training on just the labeled data with its bias toward the correct answer. Hence, one usually chooses a high (>1) temperature for the softmax to get more variability in the output of the teacher. However, this approach requires training a completely new, although smaller, model from scratch.
There are even more methods, such as low-rank approximation of linear layers and combinations of two or more of these methods. However, I have decided to focus entirely on post-training quantization in my experiments, as it is an ad hoc method that is applicable to many model architectures.
Quantization
Quantization is the process of turning floating-point numbers into integer representations with fewer bits, resulting in both a smaller memory footprint and faster operations. However, this process is lossy and requires careful calibration.A Quantization Primer
A simple yet powerful quantization algorithm is *uniform affine quantization*. It converts a floating-point number x into an integer p:
In the above equation, S is the scale and Z is the zero point. The scale essentially determines the resolution of the quantization. For example, if the scale is set at 40, then floating-point numbers in the range [x - 20, x + 20] will be converted to the same quantized value p. Hence, careful selection of the scale is crucial to the quantization performance. If you know that most of your inputs are in the range set by [min, max], then you can set the scale according to this formula for 8-bit quantization:
The second parameter, the zero point, has the purpose of mapping zero to a quantized number, which is useful to, for example, maintain sparsity or pad your data with zeros for sequential data like text input. If your numbers are again in the above range, then Z can be calculated as follows:
For example, for a range of [-2, 3], S will be 0.0196 and Z will be 102. Using the formula for affine quantization, -1 gets quantized to 5. Converting numbers outside this range is not possible. All values below -2 must be mapped to 0 and all values above 3 must be mapped to 255.
Quantization Challenges
Choosing a good quantization method for floating-point numbers is only the first of many challenges. Another one is multiplying two 8-bit numbers, which results in a 16-bit number, or the even more complex scenario of matrix multiplication, which needs accumulation into 32-bit numbers. The results of such operations need to be scaled back down to new 8-bit representation by calculating new scales and zero points. Furthermore, to keep most operations of the matrix multiplication between 8-bit numbers, additional tricks need to be applied, like rewriting the equations. Fortunately, frameworks like PyTorch take care of these implementation details.
Static vs Dynamic Quantization
PyTorch offers two ways of quantizing a model: dynamic and static quantization. With the former, the weights are quantized and stored as such. However, the normalization layers and activations remain in floating point and are calculated using floating point operations. In other words, only the compute-intensive tasks like the linear layer are quantized. With it, the inputs of a quantized linear layer need to be quantized on the fly by calculating a suitable scale and zero point. The outputs of this linear layer are 32-bit numbers, the result of many accumulated 8-bit multiplications. Hence, they need to be converted back to floating-point numbers to be forwarded to the next activation functions. Overall, this approach works well for models with large linear blocks that are the computation bottleneck, like transformers.
Furthermore, in static quantization, the layers are fused together to reduce the number of such downcasts. For example, the sequence Conv → BatchNorm → ReLU is fused into a single ConvReLU layer. However, quantizing these activations requires observing them to find good values for the scale and zero point. Hence, it demands an additional post-training calibration with a calibration set and inserted observers at the activation layers that gather the statistics to calculate these values.Static quantization cannot be used with transformers because observations have shown that there are occasional outlier activations with high values, which make selecting a scale with sufficient resolution hard. As a result, quantizing the activations leads to performance degradation. However, recent approaches like SmoothQuant managed to overcome these issues and allow for quantization of activation functions without a high loss in accuracy.
Post Training Quantization vs. Quantization Aware Training
The methods introduced in the previous section were all applied after the training and work well for quantizations down to 8 bits. For quantizing a model to 4 bits or fewer, other approaches are needed. One approach is Quantization Aware Training (QAT), which applies fake quantization in the forward pass of the training. Fake quantization does the affine transformation, but its outputs are converted to integers. Instead, they remain floating-point numbers, e.g., if the quantization would result in 17, then the fake quantization returns 17.0. However, such functions are not differentiable. Therefore, tricks like using a Straight-Through Estimator for the gradients during the backward pass have to be applied.
Recent approaches like AdaQuant achieve better post training quantization performance by doing the quantization layer by layer, selecting the optimal quantization factor for each layer, and recalibrating the batch normalization layers after the quantization.
Experiments
I have tested these quantization concepts on the BERT and ResNet models from my previous article. For both models I had to add a QuantStub and a DeQuantStub at the beginning and end of the model to quantize and dequantize the inputs. As I tried static quantization for both models, I also had to convert all arithmetic operators to their quantized equivalent FloatFunctional operators. This was necessary, as I was using the eager quantization mode. PyTorch is currently in the process of moving all architecture optimization work to Torch AO, which offers graph-based quantization that makes such manual changes redundant. Furthermore, for the ResNet model, I tried fusing all the convolutional, batch normalization, and ReLU layers. However, this resulted in a significant accuracy loss. Hence, I could just fuse the first two for all but the first layer.
BERT
For BERT, I am just showing the results for dynamic quantization, as static quantization did not work. In the plot below, we can see that it resulted in about a 20% speed-up across all batch sizes.
These speed-ups are much less than expected and also resulted in the model making predictions not much more accurate than random guesses. The cause for both is likely a mistake I made in the eager conversion of BERT to 8-bit integers, as other work showed a more than 3x speed-up.
ResNet
For ResNet, I have results for the static quantization with and without layer fusing. The speed-ups are larger than for BERT, with 39% and 34%, respectively, for a batch size of one.
Again, the speed-ups are less than expected and are in no way proportional to the effort it took to set up the quantization of these models. At least for the ResNet model, the quantization did not lead to a performance degradation.
Conclusion
PyTorch supports the quantization of models. However, using its eager implementation involves manually adapting the model for, in the worst case, minor performance gains. My personal recommendation is to first optimize the model runtime as described in part one of this series. Only if the model does not yet run fast enough with these optimizations would I consider quantizing them. Fortunately, quantization can be applied on top of it. The plot below shows, for example, how applying quantization to the IPEX optimization leads to extra performance gains.
If you are interested in applying these optimizations yourself, then take a look at my GitHub repository, which contains the code I used for my experiments: ML-Inference-Experiments.
More Efficient Inference Through Model Quantization © 2026 by Jeffrey Wigger is licensed under CC BY 4.0