More Efficient Inference on CPUs
A comprehension comparison of several methods to speed-up inference of ML models on CPUs.
Machine learning is an expensive endeavor with costs accruing during the entire life cycle of a model. Firstly, there is the cost-intensive training process, which involves not only the final training run but also the iterating on the model architecture, the data pipeline, and the tuning of the hyperparameters. Secondly, once trained, the model is deployed to production and used for inference workloads. Unlike the cost of training, which is a one-time expenditure, this second stage incurs ongoing costs for the entirety of the model's post-training lifecycle. Hence, it is essential to make model inference as resource efficient as possible. One such optimization is, for example, to run smaller models on CPUs instead of the much more expensive GPUs. In this article, I will examine several solutions that promise to make running ML models more feasible on CPUs.
Why Compute Graphs Matter
Early machine learning frameworks, such as Tensorflow, allowed users to specify their models as graphs. These graphs were subsequently compiled and optimized into code that was executed independently of the specified Python code. Consequently, it was challenging to debug and iterate on the model architecture. In contrast, PyTorch chose another approach. Its models are executed eagerly, i.e., as they are expressed in Python. Hence, they are executed line by line like a regular Python program, which allows data scientists to easily debug their models by inspecting them after each line of code. So, it is not surprising that PyTorch, with its eager mode, became the most widely used machine learning framework because it facilitated faster iteration.
A significant drawback of eager models is that they are more difficult to optimize for both training and inference. However, PyTorch came up with ways of turning eager models into compute graphs that can be optimized, for example, by fusing several operations into a single one. One such approach is JIT compiling the model through tracing. Unfortunately, with tracing, one cannot capture the model's control flow, such as if statements that depend on outputs of intermediary layers. Alternatively, one could rewrite the model in TorchScript, a subset of Python that is just-in-time compilable and can be converted into a graph representation of the model. However, both solutions have their drawbacks, as rewriting complex models for TorchScript or tracing is time-consuming and prone to bugs and errors. With PyTorch 2.0, a third solution, torch compile, was introduced, which uses TorchDynamo to capture the computational graph by integrating into the bytecode interpreter of CPython. It is not perfect, but using it is easier than the previous solutions, and it is the first optimization we take a look at.
Experimental Set-Up
The different inference solutions for CPUs were compared using two models: a pretrained ResNet 101 and a BERT for sequence classification model, which was fine-tuned according to this Hugging Face tutorial. The models were chosen because they use convolutional and transformer model architectures. Both are widely used architectures and should therefore serve as representative examples. For all experiments, the inference speed of the ResNet model was measured on 3200 samples from the Imagenette dataset, and the performance of the BERT model was evaluated on 1000 samples of the IMDB dataset, which was also used to adapt the BERT model to the sequence classification task.
Torch Compile and Export
PyTorch offers, with torch compile and export, two convenient ways to optimize a model. Both rely on TorchDynamo to capture the compute graph of the model, but while torch compile will run the parts it could not capture as eager Python code, torch export requires the full graph to be captured. Furthermore, torch export fully represents the model with PyTorch's efficient ATen operations. In my experiments, I was not able to export the BERT model, despite even specifying its dynamic shapes, which are required for the exported model to work with different input batch sizes. Fortunately, these issues did not occur for the ResNet model, and I was able to export it without any issues.
For ResNet, both the compiled and exported models outperformed the non-optimized model for a batch size of one. However, only the exported model was able to also do so for higher batch sizes. Nonetheless, all three methods benefited from using higher batch sizes, likely due to more efficient usage of caches for the convolution operations. It is also important to note that torch compilation was principally designed to speed up training and not inference. So, these results are not surprising.
As aforementioned, I was only able to compile BERT and not export it. For it, compilation did not lead to any speed-up, even for a batch size of one. Unlike for ResNet, there is also no benefit in using larger batch sizes on CPUs. On the contrary, the runtimes increased significantly, which is likely due to smaller sentences getting padded to match the largest sentence of the batch. Thus we may conclude that batch jobs are not by themselves suited to speed up ML workloads on CPUs, unlike on GPUs, where data transfers often dominate compute times.
For both models, compilation had no or negligible impact on their accuracy. Hence, the speed-ups for ResNet come without drawbacks in model performance.
OpenVino
OpenVINO is a Python toolkit developed by Intel that optimizes models for their CPU and GPU architectures. The models can either be provided in a framework-agnostic format like ONNX or can be converted from popular ML frameworks like PyTorch, TensorFlow, or JAX. For PyTorch, the conversion can be done with their framework-agnostic conversion function or by using the OpenVINO backend of torch compile.
For ResNet, both the model optimized with the OpenVINO SDK directly and with the OpenVINO backend for torch compile outperformed the baseline. The former provided a more than 100% speed-up for a batch size of one and more than 60% for larger batch sizes. The speed-ups of the compiled version are less impressive, ranging from 17 to 36 percent.
For BERT, I was not able to export the model with the OpenVINO SDK. I could only compile it with the OpenVINO PyTorch backend, for which I observed a modest 8% performance boost for a batch size of one and no performance gains for larger batch sizes.
Neither the optimization with the OpenVINO SDK nor the torch compilation had any impact on the accuracy of the models. Therefore, we have another performance-neutral optimization strategy.
IPEX
The Intel Extension for PyTorch, IPEX, is another Intel open-source project that optimizes PyTorch models by taking advantage of specialized instructions for vectorization and other machine learning processor instructions. These modifications are applied through the IPEX backend in torch compile. In addition, it is also possible to add more performance improvements to the model, such as layer fusions and custom optimized layers, with the optimize function provided by the IPEX SDK. For my experiments, only the default optimizations have been applied, the results of which are shown in the following plots.
Once again, everything works well for ResNet leading to clear gains over the baseline for all batch sizes. The performance improvements increase with the batch size and reach 42% for a batch size of 8.
Unfortunately, as in the previous experiments, this optimization also has no impact on the performance of the BERT model. The runtimes remain almost identical to the baseline for all batch sizes.
ONNX
The creation of ONNX models is supported by most major ML frameworks, such as PyTorch, TensorFlow, and Scikit-Learn. However, it is not always possible to export a model built with these frameworks to ONNX, as this article for Scikit-Learn shows. In case an export is not possible, one could fall back to describing custom operators if the used runtime supports them. This, alongside the difficulty of specifying the dynamic tensors of the model, i.e., tensors whose size may vary like the batch size of the input, may make ONNX an inconvenience to use for some models.
For my experiments, I was able to export the ResNet model via the TorchDynamo backend for ONNX. Yet, the same approach did not work for the BERT model. Instead, I had to fall back on Hugging Face's Optimum framework to get an ONNX version of the BERT model. These are more challenges that make ONNX difficult to adopt as a standard in a corporate setting. Nonetheless, if one would like to do so, then either data science teams would have to guarantee that their models are convertible by limiting their model types or by writing custom operators.
As ONNX is just a format for storing the model, one needs a custom runtime to run it. A popular choice, which I used for my experiments, is Microsoft’s ONNX Runtime. It is capable of running ONNX models on different CPU, GPU, and ML-optimized hardware platforms.
The performance gain for ResNet is quite interesting and differs from what we have seen so far. For a batch size of one, the model run through the ONNX Runtime is almost as fast as the OpenVINO model. However, unlike with OpenVINO, there is almost no additional performance increase for larger batch sizes.
The BERT model also keeps it intriguing. We see a small performance gain over the baseline for a batch size of one, but after that the ONNX Runtime completely fails by making inference much slower than the baseline. As a small bonus, I have also included the results for the BERT ONNX model run with OpenVINO. For BERT, it is the only experiment with significant speed-ups for a batch size of one and the only method that could provide small gains for larger batch sizes.
The experiments show that ONNX's primary strength lies in inference with a batch size of one. For larger batch sizes, no optimization or methods like OpenVINO perform significantly better.
Conclusion
Optimizing models for inference on CPUs is not as straightforward as one would hope. While there are many methods, some of which were presented in this article, all of them may be difficult or impossible to configure for your model. Many need a good understanding of the models to be able to, for example, specify their dynamic shapes. In conclusion, OpenVINO was the most performant solution for both models, especially so if it is employed in conjunction with an ONNX model. However, OpenVINO is built for Intel CPUs; if you run your models on anything else, then your best bet might be ONNX as long as you can keep your batch size at one. Lastly, to end on a high note, for all methods there was no or an insignificant impact on the model's accuracy.
If you are interested in applying these optimizations yourself, then take a look at my GitHub repository, which contains the code I used for my experiments: ML-Inference-Experiments.
More Efficient Inference on CPUs © 2025 by Jeffrey Wigger is licensed under CC BY 4.0