A learned router activates only k of B blocks per timestep, choosing them from the accumulated recurrent state (not the current token) — so the model can switch the active sub-network mid-sequence as context changes. This is the SNN routing idea ported to a continuous diagonal-SSM substrate (no surrogate-gradient boundary). The never-published 2×2 ablation is router-state (stateful | stateless) × block-state (ssm | ff); stateless token-wise routing is the MoE-Mamba / BlackMamba prior art. Single seed per config unless noted.
Path-memory task: a transient cue sets the active context; later steps must apply it (label = symbol + remembered context). Solving it requires holding context in recurrent state and routing to the matching block. stateful-ssm matches dense at 12.5% active; stateless-ssm (MoE-Mamba analog) collapses; an MLP router (matching the SNN) takes stateful to ~99.9%. Key control: the MLP router does not rescue stateless — the win is the state, not router capacity.
| router | block | k / B | router net | acc (mean) | range | seeds | active | params |
|---|
Multi-Query Associative Recall (Zoology). Needs input-dependent dynamics, so blocks are selective (Mamba-style decay). This probes a different axis than cue-switch: not memory-switching but recall capacity. Question: does routed-stateful approach dense as more blocks become active (k1→k6)?
| router | block | k / B | router net | d / N | best acc | last | active | epochs | status |
|---|
| config | dense ms | best routed ms | best k | speedup | ideal B/k |
|---|