EXP — 02 · Inference Engine · PyTorch · C++ · INT8 · SIMD
PyTorch to C++ inference:
building a neural net
from scratch
Train an MNIST classifier in PyTorch, export raw weights, then reconstruct the entire forward pass in bare-metal C++ — float32, int8, batch tiling, and SIMD acceleration.
97.36%
MNIST accuracy
Matched float → int8
3 layers
Network depth
784 → 128 → 64 → 10
float + int8
Precision modes
Both implemented
10–20×
SIMD speedup
Over naive int8
Motivation
Why rebuild inference in C++?
PyTorch is an exceptional training framework, but it brings enormous runtime overhead: Python interpreter, autograd graph, operator dispatch layers, memory allocators. For a tiny network running on a Raspberry Pi 5, this overhead dominates. The question this experiment asks is deceptively simple: if you already have the trained weights, how fast can you make inference if you own every line of code?
This experiment also serves as a natural continuation of EXP-01. We already know how to optimize matrix multiplication on this hardware. Now we need to wire that knowledge into a working inference engine — with real weights, real activations, and a real accuracy number to defend.
"A neural network is, at its core, a sequence of matrix multiplications and elementwise nonlinearities. Strip away the framework and that's exactly what you're left with."
01
The network — intentionally simple
Architecture design
The network is deliberately minimal. Three fully-connected layers with ReLU activations — no convolutions, no dropout, no batch normalization. The goal isn't state-of-the-art accuracy. The goal is a network simple enough to reimplement every operation from scratch in C++, while still being non-trivial to run efficiently.
Network architecture — forward pass
The network is trained in PyTorch on the full MNIST training set (60,000 images) and evaluated on the test set (10,000 images). Training converges quickly — this is not a hard problem for a network of this size. The purpose of training is purely to obtain a set of weights that we can then transplant into our C++ engine.
784Input dimsflattened 28×28
128 → 64Hidden layersReLU activated
10Output classesdigits 0–9
97.36%Test accuracyPyTorch baseline
02
Exporting weights from PyTorch
After training, we dump the weights and biases for each layer as raw binary files — plain floating-point values, no framework metadata, no computation graph. Each layer produces two files: a weight matrix and a bias vector.
import torch
import numpy as np
model.eval()
for name, param in model.named_parameters():
arr = param.detach().cpu().numpy()
arr.astype(np.float32).tofile(f"{name}.bin")
The weight files are the only artifact from PyTorch that crosses into the C++ world. Everything else — layer construction, activation functions, the forward pass itself — is reimplemented independently. This is intentional: we want to verify that the C++ engine produces identical results to PyTorch given the same weights.
03
The C++ inference engine — modular by design
Rather than writing a monolithic forward function, the engine is structured to mirror how frameworks like PyTorch actually work internally: separate classes for layers and activations, with operator overloading to compose them cleanly.
01
Constructor — dimensions + weight loading
The Layer takes input_dim, output_dim, and scaling statistics. load_params() dynamically analyzes parameter limits to compute quantization scales, mapping continuous weights directly into uniform 8-bit integer buffers.
02
matmul() and gemm() — structural comparison
matmul() maps single linear transformations serially. gemm() incorporates 6-nested blocking loops to orchestrate highly localized matrix multiplication tiles across dynamic batches.
03
add_bias() — precision scaling boundaries
Bias compensation is executed explicitly prior to writing back the final tensor blocks, preserving scale matching thresholds across different quantization domains.
04
operator() — localized lifecycle profiling
Calling layer(x) invokes core computing blocks wrapped directly within our ARM hardware profiler context, validating edge latencies safely during execution.
class Layer {
public:
Layer(int input_dim, int output_dim, string weights_filepath, string bias_filepath, float in_scale, float out_max_val)
: in_dim(input_dim), out_dim(output_dim), input_scale(in_scale) {
weights = load_params(weights_filepath, out_dim * in_dim, weight_scale);
bias = load_params(bias_filepath, out_dim, bias_scale);
output_scale = out_max_val / 127.0f;
out.resize(output_dim);
}
vector<int8_t> load_params(const string& filepath, size_t num_elements, float& scale_out) {
vector<float> data(num_elements);
vector<int8_t> data_i(num_elements);
ifstream file(filepath, ios::binary);
file.read(reinterpret_cast<char*>(data.data()), num_elements * sizeof(float));
float max_val = 0.0f;
for (float val : data) max_val = max(max_val, fabsf(val));
scale_out = max_val / 127.0f;
for (size_t i = 0; i < num_elements; i++) {
data_i[i] = (int8_t)round(data[i] / scale_out);
}
return data_i;
}
vector<int8_t> operator() (vector<int8_t>& x) {
assert(x.size() == in_dim && "Input size mismatch with layer");
profile.start();
gemm_neon(x);
profile.stop_and_print();
return out;
}
private:
int in_dim, out_dim;
vector<int8_t> weights, bias, out;
float input_scale, weight_scale, bias_scale, output_scale;
ARMCacheProfiler profile;
};
04
Float model — verifying correctness
The float inference engine is the ground truth. Before optimizing anything, we need to confirm that our C++ forward pass produces the same predictions as PyTorch on every test image. This is non-trivial: floating-point arithmetic is not exactly associative, and differences in operation order can produce subtly different results.
The matmul at this stage is the cache-friendly loop-reordered formulation from EXP-01: i → k → j order, accessing both input and weight matrices sequentially. For batch size 1, this is sufficient — each input uses the weight matrix exactly once, so cache reuse within a single inference call is good.
97.36%Float C++ accuracyMatches PyTorch exactly
float32Weight precision4 bytes per param
101,770Total parameters784×128 + 128×64 + 64×10
~400KBWeight footprintFits comfortably in L2
Achieving identical accuracy to PyTorch confirms the weight export, memory layout assumptions, and matmul correctness are all correct. Any deviation here would be a bug, not a tradeoff.
05
INT8 model — quantizing weights
The float model works. Now we go smaller. INT8 quantization replaces each 32-bit float weight with an 8-bit integer, cutting weight memory by 4× and — crucially — enabling integer SIMD instructions that can process more elements per cycle than their float equivalents.
The quantization scheme is simple: for each weight matrix, compute the scale factor as max(|W|) / 127, then quantize by rounding W / scale to int8. During inference, multiply the int8 inputs by int8 weights and accumulate into int32 accumulators, then rescale back to float before the activation.
void Layer::matmul(vector<int8_t>& X) {
for(int oi = 0; oi < out_dim; oi++) {
int32_t acc = 0;
int8_t* w_ptr = weights.data() + (oi * in_dim);
for(int ii = 0; ii < in_dim; ii++) {
acc += (int32_t)X[ii] * (int32_t)w_ptr[ii];
}
float val = (acc * input_scale * weight_scale) + (bias[oi] * bias_scale);
out[oi] = (int8_t)clamp((int)round(val / output_scale), -127, 127);
}
}
97.36%INT8 accuracyNo degradation
int8Weight precision1 byte per param
4×Memory reductionvs float32
~100KBINT8 weight footprintFits in L1+L2
For this particular network and dataset, INT8 quantization produces no accuracy loss. The weight values are well-distributed and the network is not operating near any precision boundaries. This is not always the case — larger networks or harder tasks often require careful quantization-aware training — but for MNIST it is essentially free.
06
Batch inference and the cache problem resurfaces
The single-input (batch size = 1) engine works well. The weight matrix is used once per inference call, and the cache-friendly access pattern keeps it warm throughout. But when we need to process many inputs — evaluating the full 10,000-image test set, for example — running them one at a time is wasteful.
The naive batched approach just loops the single-input engine: for each image, run the full forward pass. This works correctly, but it has a serious cache problem that EXP-01 taught us to recognize immediately.
Weight reuse problem in naive batch inference
Input 1 pass
weights loaded ✓
Input 2 pass
weights evicted → reload
Input N pass
weights evicted → reload
The problem: after processing input 1 through all three layers, the weight arrays have been streamed through the cache and the earliest-accessed elements are likely evicted. When input 2 arrives, the weights must be reloaded from DRAM. With a batch of 1000 images, this happens 1000 times.
The insight from EXP-01 applies directly: if weights are going to be reused across inputs, they should stay in cache across those inputs. This is exactly the tiling argument — the same logic that makes matrix tiling effective for square GEMM applies here to the batch × weights multiply.
07
Tiled batch matmul — keeping weights cache-resident
The fix is a batch-aware tiled matmul. Instead of processing one input at a time through the full weight matrix, we process a tile of inputs simultaneously — using the weight tile for BS inputs before evicting it. This is the same 6-loop blocking structure from EXP-01, adapted for the asymmetric case where the weight matrix is fixed and inputs vary.
void Layer::gemm(vector<int8_t>& X) {
int M = X.size() / in_dim, K = in_dim, N = out_dim;
int BS = 8;
int M_rounded = (M / BS) * BS, N_rounded = (N / BS) * BS, K_rounded = (K / BS) * BS;
for (int mm = 0; mm < M_rounded; mm += BS) {
for (int nn = 0; nn < N_rounded; nn += BS) {
int32_t accum[BS][BS] = {0};
for (int kk = 0; kk < K_rounded; kk += BS) {
for (int m = 0 ; m < BS; m++) {
for (int n = 0; n < BS; n++) {
for (int k = kk; k < kk+BS; k++) {
accum[m][n] += (int32_t)X[(m + mm) * K + k] * (int32_t)weights[(nn + n) * K + k];
}
}
}
}
The key observation: within a single tile iteration, the weight sub-block W[ko:ko+TILE_K][no:no+TILE_N] is reused BS times — once for each input in the batch tile. As long as BS × TILE_K × TILE_N fits in L1 or L2 alongside the input tile, the weights stay cache-resident throughout. No redundant DRAM loads.
This resolves the memory-bound constraint for batch inference. The improvement is most pronounced for large batches, where the naive approach's weight reloading cost compounds badly.
08
SIMD vectorization — processing 16 elements at once
With the cache problem solved, compute throughput is the next bottleneck. The INT8 path enables a particularly powerful optimization: ARM Neon's vdotq_s32 intrinsic computes the dot product of groups of 4 bytes simultaneously and accumulates them straight into 32-bit registers.
For the INT8 batch matmul inner loop, we vectorize across the K dimension — loading 16 weight bytes and 16 input bytes simultaneously. This maps perfectly onto 128-bit wide NEON registers.
for (int mm = 0; mm < M_rounded; mm += BS) {
for (int nn = 0; nn < N_rounded; nn += BS) {
int32x4_t accum[BS][BS];
for (int m = 0; m < BS; m++) {
for (int n = 0; n < BS; n++) accum[m][n] = vdupq_n_s32(0);
}
for (int kk = 0; kk < K_rounded; kk += 16) {
for (int m = 0; m < BS; m++) {
int8x16_t x_vec = vld1q_s8(&X[(mm + m) * K + kk]);
for (int n = 0; n < BS; n++) {
int8x16_t w_vec = vld1q_s8(&weights[(nn + n) * K + kk]);
accum[m][n] = vdotq_s32(accum[m][n], x_vec, w_vec);
}
}
}
16Elements / cycleINT8 SIMD width
10–20×Speedup vs naiveAvg across batch sizes
int32Accumulator typePrevents overflow
NeonISA extensionCortex-A76 native
The 10–20× speedup range reflects variation across batch sizes and layer dimensions. Larger layers (784×128) benefit more than smaller ones (64×10) because the vectorization overhead amortizes over more work. The speedup is also sensitive to alignment — ensuring weight arrays are 16-byte aligned eliminates unaligned load penalties.
The combination of tiled batch matmul (eliminating redundant weight loads) and SIMD vectorization (processing 16 elements per instruction) addresses both the memory-bound and compute-bound limitations simultaneously. Neither alone gets the full benefit.
09
End-to-end results — the full picture
| Engine variant | Precision | Batch support | Matmul strategy | Accuracy | Relative latency |
| Float BS=1 | float32 | Single input | Loop reorder | 97.36% | 1.0× (baseline) |
| INT8 BS=1 | int8 | Single input | Cache-friendly | 97.36% | ~0.8× |
| INT8 batched naive | int8 | N inputs | Serial loops | 97.36% | ~0.6× per input |
| INT8 tiled batch | int8 | N inputs | Tiled (cache-aware) | 97.36% | ~0.3× per input |
| INT8 tiled + SIMD | int8 | N inputs | Tiled + Neon | 97.36% | ~0.05–0.1× per input |
Accuracy is preserved at 97.36% across all variants. Every optimization is a pure latency improvement with no accuracy cost — a consequence of the INT8 quantization working well for this task and the tiling/SIMD changes being mathematically equivalent to the unoptimized code.
Conclusions
What this experiment teaches us
Lesson 01
Frameworks are layers of abstraction over simple math
The entire forward pass of this network is three matrix multiplications and two elementwise max operations. Stripping the framework reveals how little is actually happening — and how much room there is to optimize it.
Lesson 02
The cache problem compounds across batch dimensions
A single-input engine that looks cache-friendly can be cache-hostile at batch scale. The batch dimension changes which data is reused and how — a separate analysis is needed for every access pattern change.
Lesson 03
INT8 quantization is often free for small networks
For MNIST, quantization produced zero accuracy loss. The weight distributions were well-behaved and the task not precision-sensitive. This doesn't generalize to all tasks, but it's a strong starting point assumption.
Lesson 04
SIMD and tiling are multiplicative, not additive
Tiling alone resolved the memory bottleneck. SIMD alone improved throughput. Together they address separate constraints — memory bandwidth and compute throughput — and their benefits compound rather than overlap.
The 10–20× SIMD improvement is significant, but the most important result is architectural: a fully working inference engine that owns every line of code, verifiably matches PyTorch's accuracy, and runs on hardware with no OS, no runtime, and no framework overhead.