EXP — 01 · Systems Performance · Raspberry Pi 5 · ARM Neon SIMD

Taming the cache:
a journey through
GEMM optimization

From 45% cache miss catastrophe to SIMD-accelerated efficiency — every decision documented.

98.7s Naive latency Baseline — starting point

2.81s Best latency 35× faster

45.3% Cache miss: start Nearly half all accesses

1.20% Cache miss: tiled 38× improvement

Introduction

Why GEMM exposes everything wrong with your memory access pattern

General Matrix Multiplication (GEMM) — computing C = A × B — is among the most fundamental operations in computing. It drives deep learning inference, scientific simulation, graphics pipelines, and signal processing. But despite its apparent simplicity, a naive implementation on modern hardware can be devastatingly slow.

The culprit is the memory hierarchy. Modern processors are fast; their caches are small; DRAM is slow. A naive GEMM thrashes the cache spectacularly. This post documents exactly what happens when you take that naive code and systematically fix it — measured on a Raspberry Pi 5 with ARM Cortex-A76 cores, using hardware performance counters and Neon SIMD intrinsics.

"The memory wall is real. Every algorithm you write is negotiating with the cache — whether you know it or not."

We explored five distinct approaches: naive, loop-reordered, scalar tiling at four block sizes, and SIMD-accelerated tiling at the same four block sizes. The results tell a clear, quantitative story.

Naive GEMM — a cache demolition derby

The starting point

The straightforward three-loop formulation iterates i over rows of A, j over columns of B, and k over the shared dimension. The innermost access pattern for B is B[k][j] — stepping through an entire column, one element per row. In row-major memory layout, this is the worst possible access: every element is a full row-stride apart, virtually guaranteed to be a cache miss.

// Naive: i → j → k (column-major access on B)
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      C[i][j] += A[i][k] * B[k][j]; // ← cache hostile!

98.7sLatencyBaseline

45.3%Cache Miss RateNearly 1 in 2 accesses

0.218IPCCPU starved for data

9298MCache MissesBillions of misses

With a 45% cache miss rate, the processor spends most of its time waiting on DRAM. The IPC of 0.218 — barely a fifth of an instruction per cycle — tells the story: the CPU is almost entirely stalled. At 45% miss rate, the Cortex-A76's out-of-order engine has almost nothing to speculatively execute. The memory bus is the bottleneck, not the arithmetic units.

Loop reordering — the 29× fix that took one minute

The fix is almost offensively simple: swap the j and k loops. This changes the innermost access to B[k][j] traversed row-by-row — sequential memory — instead of column-by-column. The cache prefetcher can now do its job.

// Reordered: i → k → j (row-major access on B)
for (int i = 0; i < N; i++)
  for (int k = 0; k < N; k++)
    for (int j = 0; j < N; j++)
      C[i][j] += A[i][k] * B[k][j]; // ← now cache friendly!

3.40sLatency29× faster than naive

4.03%Cache Miss RateFrom 45.3% — 11× lower

1.192IPC5.5× more efficient

173MCache Misses54× fewer misses

This single change — reordering two lines — produced a 29× speedup and an 11× improvement in cache hit rate. It's the clearest demonstration that algorithmic memory access patterns matter more than microarchitectural tricks. But 4% miss rate still leaves room for improvement.

Cache-aware tiling — blocking for the L1/L2 hierarchy

Loop reordering helped considerably, but the update C[i][j] still scatters writes across a large matrix. Tiling (blocking) constrains the working set to fit within the cache at any moment — the core idea is to process small B×B sub-matrices at a time, so all three matrices stay hot in cache during computation of that tile.

// Tiled: 6 nested loops — tiles then elements
for (ii = 0; ii < N; ii += BS)
  for (kk = 0; kk < N; kk += BS)
    for (jj = 0; jj < N; jj += BS)
      for (i = ii; i < ii+BS; i++)
        for (k = kk; k < kk+BS; k++)
          for (j = jj; j < jj+BS; j++)
            C[i][j] += A[i][k] * B[k][j];

We tested four block sizes (BS = 4, 8, 16, 32) to understand how different tile sizes interact with the L1 and L2 cache sizes on the Cortex-A76.

Scalar tiling — latency by block size (lower is better)

BS = 4

8.27s · 1.2% miss

BS = 8

5.20s · 1.9% miss

BS = 16

4.30s · 8.6% miss

BS = 32

2.82s · 8.5% miss

The pattern is striking: smaller block sizes (BS=4, BS=8) achieve dramatically lower cache miss rates (1.2%–1.9%) because the tile fits comfortably in L1. Larger blocks (BS=16, BS=32) see miss rates creep back to 8–9% as tiles spill into L2, yet still run faster — because arithmetic intensity improvement outweighs the cache penalty. BS=32 at 2.82s represents a 35× speedup over naive.

At BS=4, the cache miss rate drops to 1.2% — 38× better than naive. The working set (3 tiles of 4×4 floats = 192 bytes) fits entirely in L1 with room to spare.

NEON SIMD vectorization — exploiting the Cortex-A76's vector engine

Tiling improved cache behavior. But each multiply-accumulate still operated on single floats. The Raspberry Pi 5's Cortex-A76 cores have 128-bit Neon SIMD units capable of processing 4 single-precision floats simultaneously. Adding SIMD to our tiled loop should multiply throughput — if we can keep the units fed.

// Neon SIMD inner loop (BS=8 example)
float32x4_t va = vdupq_n_f32(A[i][k]);
float32x4_t vb = vld1q_f32(&B[k][j]);
float32x4_t vc = vld1q_f32(&C[i][j]);
vc = vfmaq_f32(vc, va, vb); // fused multiply-add × 4
vst1q_f32(&C[i][j], vc);

Method	Latency (s)	IPC	Miss Rate	vs Naive
Naive	98.73	0.218	45.25%	—
Loop Reorder	3.40	1.192	4.03%	29× faster
Scalar BS=4	8.27	3.489	1.20%	12× faster
Scalar BS=32	2.82	2.972	8.50%	35× faster
SIMD BS=4	5.06	3.489	1.55%	20× faster
SIMD BS=16	3.92	2.325	8.66%	25× faster

SIMD brings consistent latency improvements across BS=4, 8, and 16. Interestingly, at BS=32 the SIMD version is marginally slower — likely due to increased register pressure and slightly worse cache behavior from SIMD loads that span cache line boundaries at that tile size.

Conclusions

What the data actually teaches us

Lesson 01

Memory access patterns are the first optimization frontier

A single loop reorder delivered a 29× speedup. No compiler flag, no hardware trick, no clever math. Just fixing the access order. This is the most dramatic result in the entire experiment.

Lesson 02

Cache miss rate and latency are not the same thing

BS=4 scalar achieves the best miss rate (1.20%) but is slower than BS=32 scalar (2.82s). Larger tiles amortize loop overhead even as they stress cache more. Optimizing for miss rate alone is incomplete.

Lesson 03

SIMD is powerful but not unconditional

SIMD helps at BS=4/8/16 but is counterproductive at BS=32. Vector loads and register pressure interact with tile granularity — block size and SIMD width must be tuned together.

Lesson 04

IPC is a signal, not a goal

SIMD's wide operations make each "instruction" count more work, so IPC comparisons across scalar/SIMD are apples-and-oranges. Always look at wall-clock latency.

The journey from 98.7 seconds to under 3 is almost entirely a story of feeding the cache correctly. The single largest gain came from a two-line fix that made the memory access pattern sequential. Modern hardware is extraordinary — but only if you let it work with the cache, not against it.

EXP — 02 · Inference Engine · PyTorch · C++ · INT8 · SIMD

PyTorch to C++ inference:
building a neural net
from scratch

Train an MNIST classifier in PyTorch, export raw weights, then reconstruct the entire forward pass in bare-metal C++ — float32, int8, batch tiling, and SIMD acceleration.

97.36% MNIST accuracy Matched float → int8

3 layers Network depth 784 → 128 → 64 → 10

float + int8 Precision modes Both implemented

10–20× SIMD speedup Over naive int8

Motivation

Why rebuild inference in C++?

PyTorch is an exceptional training framework, but it brings enormous runtime overhead: Python interpreter, autograd graph, operator dispatch layers, memory allocators. For a tiny network running on a Raspberry Pi 5, this overhead dominates. The question this experiment asks is deceptively simple: if you already have the trained weights, how fast can you make inference if you own every line of code?

This experiment also serves as a natural continuation of EXP-01. We already know how to optimize matrix multiplication on this hardware. Now we need to wire that knowledge into a working inference engine — with real weights, real activations, and a real accuracy number to defend.

"A neural network is, at its core, a sequence of matrix multiplications and elementwise nonlinearities. Strip away the framework and that's exactly what you're left with."

The network — intentionally simple

Architecture design

The network is deliberately minimal. Three fully-connected layers with ReLU activations — no convolutions, no dropout, no batch normalization. The goal isn't state-of-the-art accuracy. The goal is a network simple enough to reimplement every operation from scratch in C++, while still being non-trivial to run efficiently.

Network architecture — forward pass

Input
784

28×28
pixels

→

784 × 128

Linear

→

ReLU

→

128 × 64

Linear

→

ReLU

→

64 × 10

Linear

→

Logits
10

argmax

The network is trained in PyTorch on the full MNIST training set (60,000 images) and evaluated on the test set (10,000 images). Training converges quickly — this is not a hard problem for a network of this size. The purpose of training is purely to obtain a set of weights that we can then transplant into our C++ engine.

784Input dimsflattened 28×28

128 → 64Hidden layersReLU activated

10Output classesdigits 0–9

97.36%Test accuracyPyTorch baseline

Exporting weights from PyTorch

After training, we dump the weights and biases for each layer as raw binary files — plain floating-point values, no framework metadata, no computation graph. Each layer produces two files: a weight matrix and a bias vector.

# PyTorch export — strip the network to raw floats
import torch
import numpy as np

model.eval()
for name, param in model.named_parameters():
    arr = param.detach().cpu().numpy()
    arr.astype(np.float32).tofile(f"{name}.bin")
    # Produces: fc1.weight.bin, fc1.bias.bin,
    #           fc2.weight.bin, fc2.bias.bin,
    #           fc3.weight.bin, fc3.bias.bin

The weight files are the only artifact from PyTorch that crosses into the C++ world. Everything else — layer construction, activation functions, the forward pass itself — is reimplemented independently. This is intentional: we want to verify that the C++ engine produces identical results to PyTorch given the same weights.

The C++ inference engine — modular by design

Rather than writing a monolithic forward function, the engine is structured to mirror how frameworks like PyTorch actually work internally: separate classes for layers and activations, with operator overloading to compose them cleanly.

Constructor — dimensions + weight loading

The Layer takes input_dim, output_dim, and scaling statistics. load_params() dynamically analyzes parameter limits to compute quantization scales, mapping continuous weights directly into uniform 8-bit integer buffers.

matmul() and gemm() — structural comparison

matmul() maps single linear transformations serially. gemm() incorporates 6-nested blocking loops to orchestrate highly localized matrix multiplication tiles across dynamic batches.

add_bias() — precision scaling boundaries

Bias compensation is executed explicitly prior to writing back the final tensor blocks, preserving scale matching thresholds across different quantization domains.

operator() — localized lifecycle profiling

Calling layer(x) invokes core computing blocks wrapped directly within our ARM hardware profiler context, validating edge latencies safely during execution.

// Modular Layer Architecture with Fixed-Width Quantization Scales
class Layer {
public:
    Layer(int input_dim, int output_dim, string weights_filepath, string bias_filepath, float in_scale, float out_max_val) 
        : in_dim(input_dim), out_dim(output_dim), input_scale(in_scale) {
        
        weights = load_params(weights_filepath, out_dim * in_dim, weight_scale);
        bias = load_params(bias_filepath, out_dim, bias_scale);
        output_scale = out_max_val / 127.0f;
        out.resize(output_dim);
    }

    vector<int8_t> load_params(const string& filepath, size_t num_elements, float& scale_out) {
        vector<float> data(num_elements);
        vector<int8_t> data_i(num_elements);
        ifstream file(filepath, ios::binary);
        file.read(reinterpret_cast<char*>(data.data()), num_elements * sizeof(float));

        float max_val = 0.0f;
        for (float val : data) max_val = max(max_val, fabsf(val));
        scale_out = max_val / 127.0f;

        for (size_t i = 0; i < num_elements; i++) {
            data_i[i] = (int8_t)round(data[i] / scale_out);
        }
        return data_i;
    }

    vector<int8_t> operator() (vector<int8_t>& x) {
        assert(x.size() == in_dim && "Input size mismatch with layer");
        profile.start();
        gemm_neon(x);
        profile.stop_and_print();
        return out;
    }

private:
    int in_dim, out_dim;
    vector<int8_t> weights, bias, out;
    float input_scale, weight_scale, bias_scale, output_scale;
    ARMCacheProfiler profile;
};

Float model — verifying correctness

The float inference engine is the ground truth. Before optimizing anything, we need to confirm that our C++ forward pass produces the same predictions as PyTorch on every test image. This is non-trivial: floating-point arithmetic is not exactly associative, and differences in operation order can produce subtly different results.

The matmul at this stage is the cache-friendly loop-reordered formulation from EXP-01: i → k → j order, accessing both input and weight matrices sequentially. For batch size 1, this is sufficient — each input uses the weight matrix exactly once, so cache reuse within a single inference call is good.

97.36%Float C++ accuracyMatches PyTorch exactly

float32Weight precision4 bytes per param

101,770Total parameters784×128 + 128×64 + 64×10

~400KBWeight footprintFits comfortably in L2

Achieving identical accuracy to PyTorch confirms the weight export, memory layout assumptions, and matmul correctness are all correct. Any deviation here would be a bug, not a tradeoff.

INT8 model — quantizing weights

The float model works. Now we go smaller. INT8 quantization replaces each 32-bit float weight with an 8-bit integer, cutting weight memory by 4× and — crucially — enabling integer SIMD instructions that can process more elements per cycle than their float equivalents.

The quantization scheme is simple: for each weight matrix, compute the scale factor as max(|W|) / 127, then quantize by rounding W / scale to int8. During inference, multiply the int8 inputs by int8 weights and accumulate into int32 accumulators, then rescale back to float before the activation.

// Quantized Sequential Reduction Logic Inside C++ Core Engine
void Layer::matmul(vector<int8_t>& X) {
    for(int oi = 0; oi < out_dim; oi++) {
        int32_t acc = 0;
        int8_t* w_ptr = weights.data() + (oi * in_dim);
        for(int ii = 0; ii < in_dim; ii++) {
            acc += (int32_t)X[ii] * (int32_t)w_ptr[ii];
        }
        
        // Dequantize and compensate bias values linearly
        float val = (acc * input_scale * weight_scale) + (bias[oi] * bias_scale);
        
        // Clamp structural boundaries safely back to 8-bit limits
        out[oi] = (int8_t)clamp((int)round(val / output_scale), -127, 127);
    }
}

97.36%INT8 accuracyNo degradation

int8Weight precision1 byte per param

4×Memory reductionvs float32

~100KBINT8 weight footprintFits in L1+L2

For this particular network and dataset, INT8 quantization produces no accuracy loss. The weight values are well-distributed and the network is not operating near any precision boundaries. This is not always the case — larger networks or harder tasks often require careful quantization-aware training — but for MNIST it is essentially free.

Batch inference and the cache problem resurfaces

The single-input (batch size = 1) engine works well. The weight matrix is used once per inference call, and the cache-friendly access pattern keeps it warm throughout. But when we need to process many inputs — evaluating the full 10,000-image test set, for example — running them one at a time is wasteful.

The naive batched approach just loops the single-input engine: for each image, run the full forward pass. This works correctly, but it has a serious cache problem that EXP-01 taught us to recognize immediately.

Weight reuse problem in naive batch inference

Input 1 pass

weights loaded ✓

Input 2 pass

weights evicted → reload

Input N pass

weights evicted → reload

The problem: after processing input 1 through all three layers, the weight arrays have been streamed through the cache and the earliest-accessed elements are likely evicted. When input 2 arrives, the weights must be reloaded from DRAM. With a batch of 1000 images, this happens 1000 times.

The insight from EXP-01 applies directly: if weights are going to be reused across inputs, they should stay in cache across those inputs. This is exactly the tiling argument — the same logic that makes matrix tiling effective for square GEMM applies here to the batch × weights multiply.

Tiled batch matmul — keeping weights cache-resident

The fix is a batch-aware tiled matmul. Instead of processing one input at a time through the full weight matrix, we process a tile of inputs simultaneously — using the weight tile for BS inputs before evicting it. This is the same 6-loop blocking structure from EXP-01, adapted for the asymmetric case where the weight matrix is fixed and inputs vary.

// 6-Nested Cache-Aware Loop Tiling Core Implementation
void Layer::gemm(vector<int8_t>& X) {
    int M = X.size() / in_dim, K = in_dim, N = out_dim;
    int BS = 8; // Block Size targeting local cache boundary thresholds
    int M_rounded = (M / BS) * BS, N_rounded = (N / BS) * BS, K_rounded = (K / BS) * BS;

    for (int mm = 0; mm < M_rounded; mm += BS) {
        for (int nn = 0; nn < N_rounded; nn += BS) {
            int32_t accum[BS][BS] = {0};
            for (int kk = 0; kk < K_rounded; kk += BS) {
                for (int m = 0 ; m < BS; m++) {
                    for (int n = 0; n < BS; n++) {
                        for (int k = kk; k < kk+BS; k++) {
                            accum[m][n] += (int32_t)X[(m + mm) * K + k] * (int32_t)weights[(nn + n) * K + k];
                        }
                    }
                }
            }
            // Post-processing, bias inclusion, and quantization steps follow below...

The key observation: within a single tile iteration, the weight sub-block W[ko:ko+TILE_K][no:no+TILE_N] is reused BS times — once for each input in the batch tile. As long as BS × TILE_K × TILE_N fits in L1 or L2 alongside the input tile, the weights stay cache-resident throughout. No redundant DRAM loads.

This resolves the memory-bound constraint for batch inference. The improvement is most pronounced for large batches, where the naive approach's weight reloading cost compounds badly.

SIMD vectorization — processing 16 elements at once

With the cache problem solved, compute throughput is the next bottleneck. The INT8 path enables a particularly powerful optimization: ARM Neon's vdotq_s32 intrinsic computes the dot product of groups of 4 bytes simultaneously and accumulates them straight into 32-bit registers.

For the INT8 batch matmul inner loop, we vectorize across the K dimension — loading 16 weight bytes and 16 input bytes simultaneously. This maps perfectly onto 128-bit wide NEON registers.

// Vectorized Fast Path Loop with ARMv8.2-A Quad-Lane Dot Product Hardware Accelerators
for (int mm = 0; mm < M_rounded; mm += BS) {
    for (int nn = 0; nn < N_rounded; nn += BS) {
        int32x4_t accum[BS][BS];
        for (int m = 0; m < BS; m++) {
            for (int n = 0; n < BS; n++) accum[m][n] = vdupq_n_s32(0);
        }

        for (int kk = 0; kk < K_rounded; kk += 16) {
            for (int m = 0; m < BS; m++) {
                int8x16_t x_vec = vld1q_s8(&X[(mm + m) * K + kk]); // Stream 16 input bytes
                for (int n = 0; n < BS; n++) {
                    int8x16_t w_vec = vld1q_s8(&weights[(nn + n) * K + kk]); // Stream 16 weight bytes
                    // Execute 4 parallel multiply-accumulations within single clock cycles
                    accum[m][n] = vdotq_s32(accum[m][n], x_vec, w_vec);
                }
            }
        }
        // Horizontal register reductions and residual handling parameters continue below...

16Elements / cycleINT8 SIMD width

10–20×Speedup vs naiveAvg across batch sizes

int32Accumulator typePrevents overflow

NeonISA extensionCortex-A76 native

The 10–20× speedup range reflects variation across batch sizes and layer dimensions. Larger layers (784×128) benefit more than smaller ones (64×10) because the vectorization overhead amortizes over more work. The speedup is also sensitive to alignment — ensuring weight arrays are 16-byte aligned eliminates unaligned load penalties.

The combination of tiled batch matmul (eliminating redundant weight loads) and SIMD vectorization (processing 16 elements per instruction) addresses both the memory-bound and compute-bound limitations simultaneously. Neither alone gets the full benefit.

End-to-end results — the full picture

Engine variant	Precision	Batch support	Matmul strategy	Accuracy	Relative latency
Float BS=1	float32	Single input	Loop reorder	97.36%	1.0× (baseline)
INT8 BS=1	int8	Single input	Cache-friendly	97.36%	~0.8×
INT8 batched naive	int8	N inputs	Serial loops	97.36%	~0.6× per input
INT8 tiled batch	int8	N inputs	Tiled (cache-aware)	97.36%	~0.3× per input
INT8 tiled + SIMD	int8	N inputs	Tiled + Neon	97.36%	~0.05–0.1× per input

Accuracy is preserved at 97.36% across all variants. Every optimization is a pure latency improvement with no accuracy cost — a consequence of the INT8 quantization working well for this task and the tiling/SIMD changes being mathematically equivalent to the unoptimized code.

Conclusions

What this experiment teaches us

Lesson 01

Frameworks are layers of abstraction over simple math

The entire forward pass of this network is three matrix multiplications and two elementwise max operations. Stripping the framework reveals how little is actually happening — and how much room there is to optimize it.

Lesson 02

The cache problem compounds across batch dimensions

A single-input engine that looks cache-friendly can be cache-hostile at batch scale. The batch dimension changes which data is reused and how — a separate analysis is needed for every access pattern change.

Lesson 03

INT8 quantization is often free for small networks

For MNIST, quantization produced zero accuracy loss. The weight distributions were well-behaved and the task not precision-sensitive. This doesn't generalize to all tasks, but it's a strong starting point assumption.

Lesson 04

SIMD and tiling are multiplicative, not additive

Tiling alone resolved the memory bottleneck. SIMD alone improved throughput. Together they address separate constraints — memory bandwidth and compute throughput — and their benefits compound rather than overlap.

The 10–20× SIMD improvement is significant, but the most important result is architectural: a fully working inference engine that owns every line of code, verifiably matches PyTorch's accuracy, and runs on hardware with no OS, no runtime, and no framework overhead.

Taming the cache:a journey throughGEMM optimization

Why GEMM exposes everything wrong with your memory access pattern

Naive GEMM — a cache demolition derby

Loop reordering — the 29× fix that took one minute

Cache-aware tiling — blocking for the L1/L2 hierarchy

NEON SIMD vectorization — exploiting the Cortex-A76's vector engine

What the data actually teaches us

PyTorch to C++ inference:building a neural netfrom scratch

Why rebuild inference in C++?

The network — intentionally simple

Exporting weights from PyTorch

The C++ inference engine — modular by design

Float model — verifying correctness

INT8 model — quantizing weights

Batch inference and the cache problem resurfaces

Tiled batch matmul — keeping weights cache-resident

SIMD vectorization — processing 16 elements at once

End-to-end results — the full picture

What this experiment teaches us

Taming the cache:
a journey through
GEMM optimization

PyTorch to C++ inference:
building a neural net
from scratch