Lower PPL = better. Headline metric is best_val_ppl — the lowest test-set perplexity seen across any epoch. Each row is one training run; colours group runs by family (dense / routed / top1-ablation / control).
Click any column header to sort. best_val_ppl is sorted ascending (lowest PPL = best) by default. active_frac — fraction of blocks active (1.0 for dense; routed runs use <1.0). status: done = all epochs complete; partial = still running or stopped early.
| family | router | k-of-N (?) | d_model (?) | n_train | epochs | best_val_ppl (?) | final_ppl (?) | active_frac (?) | params | status | #anom (?) | wall_s | seed | tag |
|---|
Primary metric: test_ppl per epoch. Lower = better. Dense baseline shown in blue for reference. Toggle log-y to see fine differences when values span a wide range. If a run overshoots (PPL rises after the best epoch), that is normal overfitting on the small 8 k-sequence pilot corpus.
Supporting views: token-prediction accuracy and routing activity over training. test_acc is next-token accuracy (low for LM — vocab size ~50 k, so chance ≈ 0.002%). active_frac — fraction of SSM blocks active per token; dense runs are always 1.0; routed runs self-organize below 1.0.
One bar per run, sorted by best_val_ppl ascending (best first). Colour = family. The dense baseline is the primary comparison target.