Home · Routed-SSM · MQAR · WikiText · Kernels · Spatial

WikiText-103 LM — Routed SSM Training Dashboard

loading… 8 k train / 1 024 test seqs, seq_len=128 d_model=512 · N=8 blocks · d_state=32 · 10 epochs ← MQAR dashboard ← All dashboards
Scope note: Results use an 8 000-sequence subset (128 tokens each) of WikiText-103, not the full corpus (over 100 M tokens). Perplexity figures are not comparable to published full-corpus numbers. The purpose is relative comparison across routing strategies on a tractable pilot run.

Lower PPL = better. Headline metric is best_val_ppl — the lowest test-set perplexity seen across any epoch. Each row is one training run; colours group runs by family (dense / routed / top1-ablation / control).

1  Run Table

Click any column header to sort. best_val_ppl is sorted ascending (lowest PPL = best) by default. active_frac — fraction of blocks active (1.0 for dense; routed runs use <1.0). status: done = all epochs complete; partial = still running or stopped early.

family router k-of-N (?) d_model (?) n_train epochs best_val_ppl (?) final_ppl (?) active_frac (?) params status #anom (?) wall_s seed tag

2  Perplexity vs Epoch

Primary metric: test_ppl per epoch. Lower = better. Dense baseline shown in blue for reference. Toggle log-y to see fine differences when values span a wide range. If a run overshoots (PPL rises after the best epoch), that is normal overfitting on the small 8 k-sequence pilot corpus.

PPL vs Epoch (lower = better)
One line per run. Hover for exact values. Legend click to toggle individual runs.
log-y compresses large ranges; useful when comparing dense (~250) vs diverging (~1 000+) runs on the same axes.

3  Secondary Curves

Supporting views: token-prediction accuracy and routing activity over training. test_acc is next-token accuracy (low for LM — vocab size ~50 k, so chance ≈ 0.002%). active_frac — fraction of SSM blocks active per token; dense runs are always 1.0; routed runs self-organize below 1.0.

Token-prediction Accuracy vs Epoch
Higher = better. Note: LM accuracy is low by nature (50 k vocab).
Active Fraction vs Epoch (?)
Dense runs are always 1.0. Routed runs converge to their k/N ratio.

4  Best PPL by Config

One bar per run, sorted by best_val_ppl ascending (best first). Colour = family. The dense baseline is the primary comparison target.

Best Validation PPL per Run (lower = better)
Bars shorter than the dense bar indicate better perplexity than dense at equal or less compute.