Spiking neurons fire sparsely — often only 5–30% activate per timestep. A sparse kernel can skip the zero multiplications and beat dense GPU matmul, but only below a firing-rate threshold and only when the arithmetic actually hits the GPU’s compute bottleneck rather than its launch-overhead floor. This page shows which kernel wins, at what model scale, and by how much — and clearly flags whether any speedup number is vs eager PyTorch (inflated, not a meaningful bar) or vs torch.compile dense fp16 (the honest comparison that matters for real deployment).
Eight kernel lineages. Routed ones need the trained router (next_table/route0) at inference; general ones run on any trained SNN. Concept = the core idea each explores.
| family | runs on | concept | headline result |
|---|
Each point is one benchmark run. Y = speedup vs dense (log). X = selected scale parameter (log). The dashed line at 1× is break-even. Above it the kernel wins; below it, dense wins. The shaded green zone is the win region. Color = kernel (one color per kernel, shared across all three charts above). Symbol = scope: circle = general (non-routed, any SNN), square = routed (uses the trained router / next_table at inference), diamond = baseline. Click a legend entry to isolate a kernel.
Each line is one kernel across network sizes (x-axis = selected scale parameter above, shared with the scatter).
A rising line means the kernel’s speedup grows with scale —
its compute complexity diverges from dense — the real prize.
Flat means a fixed bandwidth-bound ceiling.
Falling means the advantage shrinks.
Honest (vs torch.compile) baselines only by default.
Real findings: V3-LUT-parallel rises 9×→26× (W256→W4096) — super-linear;
V3-LUT-scalar falls 3.9×→1.1× — advantage shrinks with scale;
V7.1-nonrouted is flat ~3× (1/ρ bandwidth ceiling).
Note: W=704 (SHD shape, different depth) omitted for a clean size-scaling view.
These are real ρ sweep scans: each line holds model size fixed and varies firing rate. X = firing rate ρ (log), Y = speedup vs baseline (log). A vertical tick marks each kernel’s crossover ρ where it equals dense. Above the dashed line the sparse kernel wins. This is the only chart where ρ is the meaningful axis.
torch.compile easily erases.
The meaningful comparisons are V7/V8 vs
torch.compile dense fp16:
e.g. V8.3 at 16.5×, V7.1 non-routed at 3.4×.
Eager-baseline bars below are muted and tagged ⚠.
One row per benchmark point. The scope column shows whether a kernel runs on any trained SNN (general — non-routed: dense, static-pruned incl. V8 block-sparse, or spike-pool) or requires the trained router (routed — V2/V3/V5/V7 read next_table / route0 at inference). Click column headers to sort; use the scope chips above to filter. Honest-only mode (default) hides eager-baseline rows so the speedup column stays meaningful.
| kernel | family | scope | config | ρ firing | metric | value | speedup vs dense | baseline | source |
|---|
Each line traces one kernel’s memory fraction vs baseline as model width W grows (log scale). V5 spike-pool kernels use 17–21% of V3’s GPU memory and the gap widens with W. V3’s LUT tables become infeasible above W≈2300/D=16 on 16 GB; V5 stays viable. Baseline (1.0) = cuda_v3_lut_parallel.
Phase breakdown of cuda_v3_lut_parallel at SHD W=704, D=4.
This is not a speedup chart — it’s a time budget showing where to target
optimization. Biggest cost: rec_lut_walk (recurrent LUT traversal, ~11%),
plus sync overhead. The full forward pass is ~822 μs; xproj_sel pre-pass alone is ~90 μs (11%).