Phase-Collapse Defragmentation

A learned orthogonal rotation with DC-mean centering drives 1-bit K-cache cosine alignment (β) to 0.919–0.965 across four architectures, recovering 36.6× the perplexity degradation of SRHT on Gemma-4 26B MoE. Centering lifts β by +0.098–0.139 by freeing the rotation to use the full O(d) group.

0.919–0.965
learned + centered β
four architectures
0.7979
SRHT floor ≈ √(2/π)
all models, all d
+0.098–0.139
DC-mean centering lift
Δ(βcen − βraw)
36.6×
PPL gap SRHT vs learned
Gemma-4 26B MoE

S1 — 1-bit KV for MoE inference

At long context, the KV cache dominates the memory footprint of autoregressive inference in Mixture-of-Experts models. Pushing K-cache quantization to a single sign bit per coordinate requires that rotated keys lie close to the sign lattice ${\pm\alpha}^d$. The degree of closeness is captured by the cosine alignment $\beta$: a value of 0.7979 (the SRHT floor) means roughly 20% of energy is lost per layer, compounding across layers. A value above 0.919 — achievable with a learned rotation — dramatically reduces this loss.

A learned $L^1$-max rotation with DC-mean centering achieves:

Source: cross-check/toy-narrow-d/real_beta_table.json. Theorem attention_score_gauge_invariance (lean.md) proves the rotation absorbs into $W_K$ at deployment — it is a free gauge choice.


S2 — A one-number fingerprint

Let $K = (K_1, \ldots, K_d)$ be a rotated key with identically distributed, symmetric coordinates. The moment ratio

$$ \beta \;=\; \frac{\mu_1^2}{\mu_2} \;=\; \frac{(\mathbb{E}|K_1|)^2}{\mathbb{E}[K_1^2]} \;\in\; (0, 1] $$

is the single number that determines 1-bit sign-quantization quality as head dimension grows. By Cauchy–Schwarz, $\beta \leq 1$ with equality only on the sign lattice. The asymptotic residual energy is $1 - \beta$.

The per-vector algebraic bridge is proved as cos_sign_quant_formula (JensenFloor.lean). The Gaussian first-absolute-moment ratio gives the theoretical SRHT floor: $\sqrt{2/\pi} \approx 0.7979$, proved as gaussian_moment_ratio_floor_eq. Below is a live reference widget.

Move the slider to pick a target β and watch the distribution morph through Laplace, Gaussian, measured-MoE, and uniform regimes.


S3 — The SRHT floor ≈ 0.7979

Any data-independent rotation acts as a mixing operator on key coordinates. By the Central Limit Theorem, rotated marginals converge toward Gaussian, dragging $\beta$ toward the floor $\sqrt{2/\pi}$. All four evaluated models confirm this floor with tight precision. The floor is dimension-independent.

The deviation from the theoretical floor is at most 0.003 across all models and all d. The Lean theorem srht_achieves_isotropic_floor formalizes universality.

Reading the β scale. β = 0.7979 is the free baseline: any model, any d, no training. β = 0.90 is where learned rotations land without centering. β ≥ 0.919 is what the full pipeline reaches at convergence, and the remaining gap to 1.0 costs quadratically more training per unit β — residual correction closes it cheaper than more steps.


S4 — Learned $L^1$-max rotation

A learned $L^1$-max rotation maximises $\mathbb{E}[\lVert Rk \rVert_1 / (\sqrt{d}\lVert Rk \rVert_2)]$ over a calibration set by gradient ascent. Without centering, this optimizer barely clears SRHT. With DC-mean centering, it lifts $\beta$ by +0.098–0.139 universally across all evaluated architectures.

The centering lift is the primary source of perplexity improvement. Without centering, learned $\beta$ falls only modestly above the SRHT floor. The table above (S1) shows the full per-model breakdown.


S5 — Why centering unlocks the ceiling

A key vector decomposes as $k = \mu \cdot e + v$ where $e$ is the normalised all-ones direction and $v$ is the centered residual ($\sum_i v_i = 0$). For any orthogonal $R$:

$$ \lVert Rk \rVert_2^2 = \mu^2 + \lVert Rv \rVert_2^2 $$

because $R \cdot e$ and $R \cdot v$ are orthogonal when $v$ is centered. Without centering, the rotation is penalized for not stabilizing the mean direction, constraining it to an $O(d-1)$ subgroup. Centering removes this constraint: the optimizer is free to use the full $O(d)$ group. Formally, $\sup_R \cos\theta(R, v) \geq \sup_R \cos\theta(R, \mu e + v)$, which is dc_centering_lifts_rotational_ceiling in DCMeanCentering.lean. The per-model deltas of +0.098–0.139 are the empirical manifestation.


S6 — Convergence tempo scales with d

All models reach $\beta \geq 0.85$ by step 50, regardless of head dimension. The difference emerges in the last mile: reaching $\beta = 0.92$ requires 200 steps at d=2048 (OLMoE) but 500 steps at d=512 (Gemma-4-e4b). Gemma-4-e2b (d=256) stalls at $\beta \approx$ 0.9192. Theorems beta_last_mile_quadratic_hardness and beta_hessian_nonneg formalize the quadratic cost near the sign lattice.

Source: cross-check/toy-narrow-d/hessian_scan_results.json.


S7 — Memory budget at deployment

1-bit K with FP16 V and DC-mean metadata reduces the KV footprint by approximately 1.87× at sequence length 8192. K memory shrinks by 16×; V stays FP16; DC-mean metadata is negligible at long context.

The 1.87× figure is near-constant across architectures because the 1-bit K saves exactly a 16× factor on half the KV storage. Full technical details on the explore page.


S8 — What about Q and V?

Q. Q is recomputed at each decode step and is not cached. Signing both Q and K collapses attention logit fidelity; compressing Q saves compute but not memory. Dedicated per-head Q-rotation training would be required.

V. V is cached and has the same shape as K, so 1-bit V would give a combined ~3.5× KV compression. The obstacle is that V distortion propagates directly into the hidden state, bypassing the attention-sparsity argument that makes 1-bit K viable. Papers supporting asymmetric treatment (KIVI, AsymKV) report V needs at least 2 bits for coherent generation.

[PENDING: V-quantization ablation. Whether 1-bit V is viable after a learned $R_V$ is an open question requiring a dedicated ablation.]


S9 — Further reading

For full geometry, derivations, and interactive widgets: explore page. For the Lean proof inventory, theorem signatures, and the cross-link matrix: lean page.


Cite this work

@misc{basu2026phasecollapse, title = {Phase-Collapse Defragmentation: A Moment-Ratio Framework for 1-bit {KV}-Cache Quantization in Mixture-of-Experts Transformers}, author = {Basu, Debanjan}, year = {2026}, url = {https://github.com/d3banjan/moe-gauge-paper} }

References. Zandieh et al. (2026) TurboQuant; Chee et al. (2023) QuIP; Tseng et al. (2024) QuIP#; Liu et al. (2024) KIVI; Hooper et al. (2024) KVQuant; Ashkboos et al. (2024) QuaRot; Liu et al. (2024) SpinQuant; Jiang et al. (2024) Mixtral; Dai et al. (2024) DeepSeek-MoE; Lean 4 + Mathlib.