Routing-SNN · Kernel Benchmarks

Spiking neurons fire sparsely — often only 5–30% activate per timestep. A sparse kernel can skip the zero multiplications and beat dense GPU matmul, but only below a firing-rate threshold and only when the arithmetic actually hits the GPU’s compute bottleneck rather than its launch-overhead floor. This page shows which kernel wins, at what model scale, and by how much — and clearly flags whether any speedup number is vs eager PyTorch (inflated, not a meaningful bar) or vs torch.compile dense fp16 (the honest comparison that matters for real deployment).

Kernel families — what each is trying to do

Eight kernel lineages. Routed ones need the trained router (next_table/route0) at inference; general ones run on any trained SNN. Concept = the core idea each explores.

family	runs on	concept	headline result

How speedup scales with model size — pick the axis

Each point is one benchmark run. Y = speedup vs dense (log). X = selected scale parameter (log). The dashed line at 1× is break-even. Above it the kernel wins; below it, dense wins. The shaded zone above the line is the win region. Color = kernel (one color per kernel, shared across all three charts above). Symbol = scope: circle = general (non-routed, any SNN), square = routed (uses the trained router / next_table at inference), diamond = baseline. Click a legend entry to isolate a kernel.

Speedup scaling — does the advantage grow with size?

Each line is one kernel across network sizes (x-axis = selected scale parameter above, shared with the scatter). A rising line means the kernel’s speedup grows with scale — its compute complexity diverges from dense — the real prize. Flat means a fixed bandwidth-bound ceiling. Falling means the advantage shrinks. Honest (vs torch.compile) baselines only by default.
Real findings: V3-LUT-parallel rises 9×→26× (W256→W4096) — super-linear; V3-LUT-scalar falls 3.9×→1.1× — advantage shrinks with scale; V7.1-nonrouted is flat ~3× (1/ρ bandwidth ceiling). Note: W=704 (SHD shape, different depth) omitted for a clean size-scaling view.

Firing-rate crossover — where sparse beats dense as activity drops

These are real ρ sweep scans: each line holds model size fixed and varies firing rate. X = firing rate ρ (log), Y = speedup vs baseline (log). A vertical tick marks each kernel’s crossover ρ where it equals dense. Above the dashed line the sparse kernel wins. This is the only chart where ρ is the meaningful axis.

Best speedup each kernel reaches — mind the baseline

⚠ Baseline honesty warning: Speedups are measured vs different baselines. V2/V3/V4 LUT kernels show 40–176× but those are vs eager (uncompiled) PyTorch — an inflated bar that torch.compile easily erases. The meaningful comparisons are V7/V8 vs torch.compile dense fp16: e.g. V8.3 at 16.5×, V7.1 non-routed at 3.4×. Eager-baseline bars below are muted and tagged ⚠.

scope: family:

All measurements

One row per benchmark point. The scope column shows whether a kernel runs on any trained SNN (general — non-routed: dense, static-pruned incl. V8 block-sparse, or spike-pool) or requires the trained router (routed — V2/V3/V5/V7 read next_table / route0 at inference). Click column headers to sort; use the scope chips above to filter. Honest-only mode (default) hides eager-baseline rows so the speedup column stays meaningful.

show:

kernel	family	scope	config	ρ firing	metric	value	speedup vs dense	baseline	source

Memory footprint vs model width W

Each line traces one kernel’s memory fraction vs baseline as model width W grows (log scale). V5 spike-pool kernels use 17–21% of V3’s GPU memory and the gap widens with W. V3’s LUT tables become infeasible above W≈2300/D=16 on 16 GB; V5 stays viable. Baseline (1.0) = cuda_v3_lut_parallel.

Where V3 spends its time (internal profiling)

Phase breakdown of cuda_v3_lut_parallel at SHD W=704, D=4. This is not a speedup chart — it’s a time budget showing where to target optimization. Biggest cost: rec_lut_walk (recurrent LUT traversal, ~11%), plus sync overhead. The full forward pass is ~822 μs; xproj_sel pre-pass alone is ~90 μs (11%).