● paper · v4 · 2026-04-24
Phase collapse is the pathology. Learned rotations are the cure.
Phase collapse — the drift of K-cache coordinates away from the sign lattice — is an emergent diagnostic for 1-bit K-cache quantization viability in Mixture-of-Experts transformers. (V-cache quantization is deferred to future work.) A single learned orthogonal rotation lifts the moment-ratio cosine β from the SRHT floor $\sqrt{2/\pi} \approx 0.80$ to 0.92–0.97, turning catastrophic layers into viable ones.
four architectures
data-free ceiling
e4b → 26B-MoE
FP16 2.2 → SRHT 942 → learned 25.7
§1 Universal β lift
One learned rotation per matrix. Four architectures. One pattern.
The moment ratio $\beta = (\mathbb{E}|K_1|)^2/\mathbb{E}[K_1^2]$ is the single number that governs 1-bit sign-quantization quality as head dimension grows. By Cauchy–Schwarz, $\beta \leq 1$; the asymptotic residual energy is $1-\beta$. A data-independent SRHT rotation is pinned to the Gaussian floor $\sqrt{2/\pi}$. A learned $L^1$-max rotation with DC-mean centering clears 0.92 on every model we measured.
| model | β_raw | β_cen | β_SRHT | β_learned | Δβ | PPL fp16 | PPL 1-bit learned |
|---|---|---|---|---|---|---|---|
| loading… | |||||||
β values from _data/site_data.json. Δβ is color-intensity-coded: hotter = larger lift. PPL entries marked "—" are pending (see ppl_ablation.json). The 36.6× PPL gap on Gemma-4 26B MoE is the compounding-across-layers penalty SRHT pays and learned-R avoids.
§2 Convergence dashboard
β gets to 0.85 for free. The last mile is quadratically hard.
Per-model β vs training step, with thresholds at β = 0.85 / 0.90 / 0.92. All models reach 0.85 by step 50; OLMoE (d=2048) clears 0.92 by step 200, Gemma-4 e4b (d=512) takes 500, and Gemma-4 e2b (d=256) stalls just under. Theorems beta_last_mile_quadratic_hardness and beta_hessian_nonneg formalize the quadratic cost near the sign lattice.
Dashed cyan: SRHT floor. Vermillion dots: step where β first exceeds 0.85, 0.90, 0.92.
§3 FFN extension · v4
Dense MLPs yield. MoE expert banks push back.
Extending learned-R from K-cache to feed-forward weights works cleanly for dense MLP matrices (β_eff 0.92–0.94 uniformly across 30 layers, gate/up/down). It hits a structural wall on MoE expert matrices — β plateaus at ≈ 0.83. We call this Stiefel frustration: 128 experts occupy mutually orthogonal-ish bases that no single rotation can simultaneously align. The phenomenon is as much an architectural discovery about MoE latent geometry as a quantization recipe.
MLP victory
β_eff for gate_proj, up_proj, down_proj across 30 layers. The figure is flat-in-depth: no layer is meaningfully harder than any other.
The Stiefel frustration
For MoE expert banks, β saturates near 0.83 regardless of rotation training budget. Geometrically, hundreds of experts occupy mutually near-orthogonal subspaces on the Stiefel manifold; one rotation cannot align them all. Full explainer: explore page.
§4 Interactive probes
Three tweening demos. Same scaffold as the V-cache widgets.
These feel like physics demos, not animations: drag, read the β readout, think. Below 500 px the SVG body falls back to a static snapshot.
§5 Lean status
6 theorems proved. 9 in flight.
Load-bearing theorems on the β framework, centering lift, and residual correction — proved in Lean 4 + Mathlib. Full theorem-by-theorem status lives on the Lean page and the theorems appendix; the totals below are pulled live from _data/lean_status.json.
"Deferred" proofs are blocked on a pending Mathlib lemma or a principled analytic sorry. "Stub" entries have a complete statement but an unfinished body. See lean.html for the cross-link matrix between theorems and paper sections.
§6 Paper artifacts
Paper · code · rotations.
BibTeX
@misc{basu2026phasecollapse,
title = {Phase-Collapse Defragmentation: A Moment-Ratio Framework for
1-Bit KV-Cache Quantization in Mixture-of-Experts Transformers},
author = {Basu, Debanjan},
year = {2026},
url = {https://github.com/d3banjan/moe-gauge-paper}
}