· moe-gauge

paper · v4 · 2026-04-24

Phase collapse is the pathology. Learned rotations are the cure.

Phase collapse — the drift of K-cache coordinates away from the sign lattice — is an emergent diagnostic for 1-bit K-cache quantization viability in Mixture-of-Experts transformers. (V-cache quantization is deferred to future work.) A single learned orthogonal rotation lifts the moment-ratio cosine β from the SRHT floor $\sqrt{2/\pi} \approx 0.80$ to 0.92–0.97, turning catastrophic layers into viable ones.

0.798
β_SRHT floor
0.798
β_learned (cen)
gaussian / SRHTsign lattice / learned
0.919–0.965
Learned + centered β
four architectures
0.798
SRHT floor √(2/π)
data-free ceiling
45–53°
Phase-collapse angle spectrum
e4b → 26B-MoE
36.6×
PPL gap SRHT → learned
FP16 2.2 → SRHT 942 → learned 25.7

§1 Universal β lift

One learned rotation per matrix. Four architectures. One pattern.

The moment ratio $\beta = (\mathbb{E}|K_1|)^2/\mathbb{E}[K_1^2]$ is the single number that governs 1-bit sign-quantization quality as head dimension grows. By Cauchy–Schwarz, $\beta \leq 1$; the asymptotic residual energy is $1-\beta$. A data-independent SRHT rotation is pinned to the Gaussian floor $\sqrt{2/\pi}$. A learned $L^1$-max rotation with DC-mean centering clears 0.92 on every model we measured.

model β_raw β_cen β_SRHT β_learned Δβ PPL fp16 PPL 1-bit learned
loading…

β values from _data/site_data.json. Δβ is color-intensity-coded: hotter = larger lift. PPL entries marked "—" are pending (see ppl_ablation.json). The 36.6× PPL gap on Gemma-4 26B MoE is the compounding-across-layers penalty SRHT pays and learned-R avoids.

§2 Convergence dashboard

β gets to 0.85 for free. The last mile is quadratically hard.

Per-model β vs training step, with thresholds at β = 0.85 / 0.90 / 0.92. All models reach 0.85 by step 50; OLMoE (d=2048) clears 0.92 by step 200, Gemma-4 e4b (d=512) takes 500, and Gemma-4 e2b (d=256) stalls just under. Theorems beta_last_mile_quadratic_hardness and beta_hessian_nonneg formalize the quadratic cost near the sign lattice.

loading convergence curves…

Dashed cyan: SRHT floor. Vermillion dots: step where β first exceeds 0.85, 0.90, 0.92.

§3 FFN extension · v4

Dense MLPs yield. MoE expert banks push back.

Extending learned-R from K-cache to feed-forward weights works cleanly for dense MLP matrices (β_eff 0.92–0.94 uniformly across 30 layers, gate/up/down). It hits a structural wall on MoE expert matrices — β plateaus at ≈ 0.83. We call this Stiefel frustration: 128 experts occupy mutually orthogonal-ish bases that no single rotation can simultaneously align. The phenomenon is as much an architectural discovery about MoE latent geometry as a quantization recipe.

MLP victory

β_eff for gate_proj, up_proj, down_proj across 30 layers. The figure is flat-in-depth: no layer is meaningfully harder than any other.

The Stiefel frustration

For MoE expert banks, β saturates near 0.83 regardless of rotation training budget. Geometrically, hundreds of experts occupy mutually near-orthogonal subspaces on the Stiefel manifold; one rotation cannot align them all. Full explainer: explore page.

§4 Interactive probes

Three tweening demos. Same scaffold as the V-cache widgets.

These feel like physics demos, not animations: drag, read the β readout, think. Below 500 px the SVG body falls back to a static snapshot.

§5 Lean status

6 theorems proved. 9 in flight.

Load-bearing theorems on the β framework, centering lift, and residual correction — proved in Lean 4 + Mathlib. Full theorem-by-theorem status lives on the Lean page and the theorems appendix; the totals below are pulled live from _data/lean_status.json.

loading lean counts…

"Deferred" proofs are blocked on a pending Mathlib lemma or a principled analytic sorry. "Stub" entries have a complete statement but an unfinished body. See lean.html for the cross-link matrix between theorems and paper sections.

§6 Paper artifacts

Paper · code · rotations.

BibTeX

@misc{basu2026phasecollapse,
  title  = {Phase-Collapse Defragmentation: A Moment-Ratio Framework for
            1-Bit KV-Cache Quantization in Mixture-of-Experts Transformers},
  author = {Basu, Debanjan},
  year   = {2026},
  url    = {https://github.com/d3banjan/moe-gauge-paper}
}