WikiText-103 Dashboard

loading… 8 k train / 1 024 test seqs, seq_len=128 d_model=512 · N=8 blocks · d_state=32 · 10 epochs ← MQAR dashboard ← All dashboards

Scope note: Results use an 8 000-sequence subset (128 tokens each) of WikiText-103, not the full corpus (over 100 M tokens). Perplexity figures are not comparable to published full-corpus numbers. The purpose is relative comparison across routing strategies on a tractable pilot run.

Lower PPL = better. Headline metric is best_val_ppl — the lowest test-set perplexity seen across any epoch. Each row is one training run; colours group runs by family (dense / routed / top1-ablation / route-math / control).

1 Run Table

Click any column header to sort. best_val_ppl is sorted ascending (lowest PPL = best) by default. active_frac — fraction of blocks active (1.0 for dense; routed runs use <1.0). status: done = all epochs complete; partial = still running or stopped early.

family	router	k-of-N (?)	route-math (?)	d_model (?)	n_train	epochs	best_val_ppl (?)	final_ppl (?)	active_frac (?)	params	status	#anom (?)	wall_s	seed	tag

2 Perplexity vs Epoch

Primary metric: test_ppl per epoch. Lower = better. Dense baseline keeps one fixed colour across every chart on this page (same colour as its dense pill in the table above). Toggle log-y to see fine differences when values span a wide range. If a run overshoots (PPL rises after the best epoch), that is normal overfitting on the small 8 k-sequence pilot corpus.

PPL vs Epoch (lower = better)

One line per run. Hover for exact values. Legend click to toggle individual runs.

log-y compresses large ranges; useful when comparing dense (~250) vs diverging (~1 000+) runs on the same axes.

3 Secondary Curves

Supporting views: token-prediction accuracy and routing activity over training. test_acc is next-token accuracy (low for LM — vocab size ~50 k, so chance ≈ 0.002%). active_frac — fraction of SSM blocks active per token; dense runs are always 1.0; routed runs self-organize below 1.0.

Token-prediction Accuracy vs Epoch

Higher = better. Note: LM accuracy is low by nature (50 k vocab).

Active Fraction vs Epoch (?)

Dense runs are always 1.0. Routed runs converge to their k/N ratio.

4 Best PPL by Config

One bar per run, sorted by best_val_ppl ascending (best first). Colour = family. The dense baseline is the primary comparison target.

Best Validation PPL per Run (lower = better)

Bars shorter than the dense bar indicate better perplexity than dense at equal or less compute.

WikiText-103 LM — Routed SSM Training Dashboard

1 Run Table

2 Perplexity vs Epoch

3 Secondary Curves

4 Best PPL by Config