Routing-SNN · Routed SSM

A1 · Head-to-head: stateful vs the adversarial baselines (MQAR recall, matched compute & params)

The core test: at the same compute and parameters, does choosing blocks from the recurrent state beat choosing them another way (or not routing at all)?

Multi-Query Associative Recall: store key→value pairs, recall them later. The four arms share the same blocks, same k/B (50% active), same seed; only the router rule differs. stateful reads h_t; stateless reads the token (MoE-Mamba analog); random-k routes randomly; dense runs all blocks.

accuracy	router arm	k / B	active	params	read

A1b · Training curves — does it actually learn, and when does recall “click”?

Accuracy and loss at every epoch. MQAR recall is acquired in a sudden phase transition, not gradually — so the curve, not the final number, tells you whether (and how fast) a config solves the task. A run whose loss keeps dropping while accuracy stays flat hasn’t transitioned yet; a late or absent transition flags a harder task or a degraded setup.

Each line is one run; colour identifies the run. Filter by router / difficulty / seed / variant to focus the view, and choose how seeds combine (mean ± band / median ± band / per seed). Shown by default: kv8, the recommended recipe mag25a95e80 vs dense (the baseline, always drawn), as mean ± band; widen to other difficulties, variants, and seeds with the dropdowns. (Default config = strongest found so far in the annealing study, not claimed optimal.)

Routing soft→blocky anneal study (2026-06-03) — strongest stateful-routing training config found so far on MQAR kv8/v256 (not claimed optimal): linear “magnet” anneal, anneal_frac=0.25, soft-floor alpha_max=0.95, route_select_ema=0.8, fp32, fused kernel — the curve labelled mag25a95e80. Across 7 seeds it lifts mean final accuracy 0.825→0.942 and beats dense on 5/5 seeds with a baseline, fixing both the “never-transitions” and “late-collapse” failures of the no-EMA recipe (mag25a95). Window / long-anneal / router-freeze / router-lr variants were tried and rejected. Full method: docs/routed_ssm_annealing_algorithm.md; evidence: docs/annealing_research_log.md.

router

difficulty

seed

variant

seeds

test accuracy vs epoch

train loss (CE) vs epoch (log y)

test perplexity vs epoch

router active fraction vs epoch

A2 · Recall capacity: does routed accuracy climb as more blocks become active (k1→k6)?

Here we let capacity vary: as more blocks are switched on, recall should have more room to climb.

The full MQAR sweep across block type (ssm / mamba / elm), router, and k/B. Unlike the matched A1 panel, these vary capacity. Reads the recall capacity axis: more active blocks → more recall headroom.

best acc	router	block	k / B	router net	d / N	last	active	epochs

B1 · Training kernels — making the science runs cheap

Everything that makes a training epoch fast: the GLA chunked-matmul kernel we ship for routed training (B1a), the exact parallel-scan it falls back to on long/narrow shapes (B1b), and the three-kernel bake-off + ncu profiles that say which wins where (B1c). Every × is fwd+bwd vs torch.compile on identical math (graph-vs-graph), parity-verified first.

B2 · Inference kernels — making generation cheap by skipping the blocks routing didn’t pick

Decode one token at a time (B2a) and prefill a whole prompt (B2b), routed vs dense; then the hand-tuned top-1 CUDA kernel and the compressed router that lets it scale to many blocks (B2c). Every × is vs torch.compile dense.

B2a · Decode (batch-1 streaming): routed gather vs torch.compile dense

At single-token generation, running only the active blocks beats running all of them — once the model is wide enough.

Single-step inference. Routing computes only the k active blocks. The dense diagonal-SSM baseline is memory-bound (not a tensor-core GEMM), so block-sparsity genuinely wins above a width threshold. Lead = × vs torch.compile dense; ideal B/k is the FLOP ceiling.

× vs dense	config	best k	ideal B/k	routed ms	dense ms

B2b · Routed prefill: scan k of K blocks vs dense scan

Processing a whole prompt at once: scanning only the active blocks vs scanning all of them.

× vs dense	config	best k	ideal K/k	per-k

Plateaus ~7× (launch/overhead-bound below compute saturation) — not linear K/k.

B2c · Hand-tuned top-1 kernel + compressed router — vs torch.compile dense

Computing only the single active block per token, in one fused kernel — and making the router cheap so it scales to many blocks.

All × vs torch.compile dense, parity-exact, fair graph-vs-graph. Full write-up: docs/kernel_acceleration_research_log.md. Curated relative speedups (a graph-timing artifact inflated some raw absolute-ms in logs; the × figures shown are the vetted same-harness values).

Shippable: the persistent top-1 kernel runs 1.9–3.95× faster than torch.compile dense at the trained shape (K=8), parity-exact (max|Δ| 8e-6). The ceiling is the stateful router (reads all K·P·N state — ncu: ~80% of the kernel, 88.8% memory-bound), so it loses at large K — until a compressed router (per-block summary, input 31× smaller) flips it back to a win and runs to K=128.

× vs torch.compile dense — by batch (K=8)

× vs dense — the K crossover (full router)

Persistent top-1 kernel vs torch.compile dense (full-seq MQAR, T=64)

After coalescing W_router + a warp-shuffle router reduction (+1.4–1.6× over the naive version).

× vs dense	batch	hand µs	dense µs
3.56×	32	0.58	2.04
3.95×	128	0.62	2.43
1.89×	512	1.92	3.63
2.69×	2048	6.69	17.99

Scale with K (full router) — top-1 LOSES at large K

P=D/K. The full router's all-state read grows ∝K² (faster than dense block work ∝K), so adding blocks makes top-1 worse — the reason a compressed router is needed.

× vs dense	config	D	K	router/block
1.45×	real	128	8	7.3
2.24×	D×4	512	8	3.8
1.98×	D×8	1024	8	2.0
0.80×	K=16	512	16	15.3
0.37×	K=32	1024	32	31.5

Compressed router (V3-style) — the lever for large K

Router reads a per-block scalar summary, not the full state ⇒ input 4224→136 (31×). Global-state kernel runs K=8..128. Speed result (random weights); recall needs a retrain — gated router_pool code is ready.

× vs full-router	K	router_in full→comp	runs?
2.02×	8	4224→136	yes (clean)
✓	16	4352→272	yes (full-router collapses)
✓	32	16896→544	yes (global-state)
✓	64	33792→1088	yes
✓	128	67584→2176	yes

Temporal fusion (arXiv:2408.00280) on our recurrence h=a·h+b: fused 1-kernel vs serial step-loop = 5.1–5.4× (parity 1.8e-7, stable across T=64..2048) — the technique underneath, and its backward fusion is the path to fast routed training. Memory: topk1_kernel_verdict, topk_descent_win.

B3 · Router-mode × kernel-form speed grid (in progress — partial data)

Which router can exploit which kernel form. This sweep is still running: only the dense rows and a few compiled rows are populated so far (the eager baselines have not landed, so ×-vs-eager is not yet computable). Shown for transparency; treat numbers as preliminary. Lead = raw step ms until the baseline run completes.

step ms	router	kernel	L	tok/s	peak MB	note

d_model128 d_state32 B8 batch16. Rows with no step_ms (run not finished) shown as N/A.

C1 · MoE-SSM language model (WikiText-103): state-routing vs token-routing vs dense

Honest negative result: at this small scale, the recall advantage does not yet show up as better language modeling.

Small selective-SSM LM, GPT2 tokenizer, tied embeddings. Experts = SSM blocks. state-routed (ours) vs token-routed (BlackMamba analog) vs dense. Lower perplexity is better. Pure-PyTorch backend (no mamba_ssm kernel on Windows). anneal = soft→hard route schedule.

test ppl	arm	anneal	val ppl	active	tok/s	VRAM MB	params

Lower ppl = better, so this table sorts ASCending (best/lowest on top).

Cue-switch: can the router switch the active sub-network mid-sequence?

Toy diagnostic — confirms the router can switch the active sub-network mid-sequence; not a headline result.

A transient cue sets the active context; later steps must apply it. Solving it needs holding context in the recurrent state and routing to the matching block — a memory-switch sanity check (a different axis from the A1 recall result above). The control that matters: an MLP router does not rescue stateless — the win is the state, not router capacity. chance = 0.125.

accuracy	router	block	k / B	router net	active	params

“active” = fraction of blocks computed per step. range shown when >1 seed.

Routed Diagonal-SSM · conditional computation on a state-space substrate

A · The result — does routing on the recurrent state actually help?

A1 · Head-to-head: stateful vs the adversarial baselines (MQAR recall, matched compute & params)

A1b · Training curves — does it actually learn, and when does recall “click”?

A2 · Recall capacity: does routed accuracy climb as more blocks become active (k1→k6)?

B · The kernels — can we train it and run it fast? (every × is vs torch.compile)

B1 · Training kernels — making the science runs cheap

B1a · K2 GLA training speed — the “accurate AND fast” kernel

Fused fwd+bwd vs torch.compile, by sequence length

Real MQAR training epoch — GLA-triton vs GLA-torch.compile

Routed end-to-end: K2 GLA vs K1 scan (parity-green)

Routed dispatch speed vs dense — full E2E training step (paired-interleaved)

B1b · K1 parallel-scan training kernel vs torch.compile

B1c · Kernel bake-off (reference) — K1 / K2 / K3, total ms vs sequence length

ncu SOL profiles — what limits each kernel?

K1 scan: autotuning verdict (vs orig & torch.compile)