Phase-Collapse Defragmentation

The K-cache of large MoE transformers is drifting out of the 1-bit quantization-safe zone. V-cache is explicitly out of scope. MoE routing is where the K-cache pathology breaks catastrophically. We proved a floor and showed how to beat it.

At one bit per entry, attention depends on the angle between each key vector and its sign-quantised image. Across four Gemma-4 variants we measure a clean ordinal fragmentation spectrum: the per-layer phase-collapse angle $\bar{\theta}_\ell$ drifts toward the quantisation-unsafe regime as the architecture adds complexity. The 26B MoE variant is already past the failure boundary. Data-independent rotations (the Randomized Hadamard Transform used by TurboQuant’s PolarQuant stage) cannot save it — the Central Limit Theorem pins them at the Gaussian attractor $\beta = 2/\pi$. A learned $L^1$-max rotation escapes this attractor to $\beta = 0.702$ and the 1-bit perplexity drops from $942$ to $26$.

Each panel is one architecture. Points are coloured by the bit budget the horizontal-slice allocator assigned to that layer (blue = 1-bit pressure-release, green = 2-bit ballroom, orange = 3-bit fragmented). Dashed line is the 44° ballroom cutoff. MoE is the only variant with a majority of layers above the cutoff.

e4b dense
45.47°
13% fragmented
31B dense
48.73°
58% fragmented
e2b dense
50.20°
60% fragmented
26B MoE
52.67°
77% fragmented

Layer-0 phase-collapse angles. Fragmented means $\bar{\theta}_\ell > 44^\circ$ — the horizontal-slice allocator assigns 3 bits rather than 2. Only e4b has an in-spec headline angle; the MoE sits 7.2° above it.

36.6×
empirical PPL gap
SRHT vs learned rotation
18.4×
geometric prediction
β-compounding over 30 layers
81%
log-gap explained
by the moment-ratio theorem
6.24×
KV-cache compression
at PPL = 3.00 (Pareto sweet spot)

Why 1-bit, why MoE, why now

At long context, the KV cache dominates the memory footprint of autoregressive inference. Production quantizers currently live at 2–4 bits per entry: TurboQuant reports near-optimal distortion at 3.5 bits and marginal degradation at 2.5 bits. Pushing below 2 bits — to a single sign bit per coordinate plus a sparse fp16 exception list — is a $2\times$ compression on top of that, and it is the aggressive regime where the published rotation-based tricks (QuIP, QuIP#, QuaRot, SpinQuant) start to show their seams.

The seam that this paper widens is geometric. A 1-bit sign quantiser replaces each coordinate $k_i$ with $\operatorname{sign}(k_i) \cdot \alpha$, where $\alpha = \lVert k \rVert_1 / d$. The angle $\theta_k$ between $k$ and its reconstruction controls the residual energy:

$$ \frac{\lVert k - \hat{k} \rVert^2}{\lVert k \rVert^2} \;=\; \sin^2 \theta_k \;=\; 1 \;-\; \frac{\lVert k \rVert_1^2}{d \, \lVert k \rVert_2^2}. $$

The upper bound $\cos \theta_k = 1$ is only attained if $k$ lies exactly on the sign lattice ${\pm \alpha}^d$ — a measure-zero event for any continuous distribution. So some energy is always lost. The quantitative amount is set by the distribution, and this is where MoE routing breaks the picture.


A one-number fingerprint of 1-bit quality

Let $K = (K_1, \ldots, K_d)$ be a rotated key with identically distributed, symmetric coordinates. Define the moment ratio

$$ \beta \;=\; \frac{\mu_1^2}{\mu_2} \;=\; \frac{(\mathbb{E}|K_1|)^2}{\mathbb{E}[K_1^2]} \;\in\; (0, 1]. $$

By Cauchy–Schwarz, $\beta \in (0, 1]$ with $\beta = 1$ only on the sign lattice. The central result of the paper: under marginal-identity (not independence) and the strong law of large numbers for $\lvert K_1 \rvert$ and $K_1^2$,

$$ \lim_{d \to \infty} \mathbb{E}\!\left[\sin^2 \theta_K\right] \;=\; 1 - \beta. $$

This is the asymptotic moment-ratio floor (iid_coord_moment_ratio_floor, MomentRatioFloor.lean:155). On Gemma-4 26B MoE we measure $\beta = 0.702$, predicting a floor of $29.80\%$; the directly-measured $\mathbb{E}[\sin^2 \bar{\theta}]$ is $29.84\%$, matching to 0.05 percentage points.

Below is a live reference table: move the slider to pick a target $\beta$ and watch the distribution morph through the Laplace (heavy-tailed), Gaussian (CLT attractor), measured-MoE, and uniform (light-tailed) regimes. The 1 - β floor updates in real time.

Distribution shape is a symmetric generalised Gaussian with exponent $p$ chosen so that $\mu_1^2/\mu_2 = \beta$. The chip "floor $1-\beta$" is the asymptotic expected residual energy at 1-bit.

The attractor problem. Any data-independent orthogonal rotation acts as a summing operator on the coordinates of $k$. By the Central Limit Theorem the rotated marginals converge to Gaussian, dragging $\beta$ toward $2/\pi \approx 0.637$. Randomized Hadamard is always pulled to this value on high-dimensional data — it cannot escape the attractor. A learned rotation can, because the marginal-identity hypothesis is strictly weaker than independence. On low-effective-rank activations (MoE K-cache, stable rank $\approx 16$ out of $d = 256$) the learned $L^1$-max rotation pushes $\beta$ past Gaussian toward the uniform limit $3/4$. We call this the anti-kurtosis effect.

Six points of β compound to 18× at depth 30

The moment-ratio gap between SRHT and learned rotations looks small — only $0.702 - 0.637 = 0.065$ in $\beta$. But attention signals compound multiplicatively across layers of residual propagation. Over $L = 30$ MoE layers,

$$ \mathcal{G} \;=\; \left(\frac{\beta_{\text{learned}}}{\beta_{\text{SRHT}}}\right)^{L} \;=\; \left(\frac{0.702}{0.637}\right)^{30} \;\approx\; 18.4\times. $$

Press play below to watch the three survival curves $\beta^L$ unroll from $L = 0$ to $L = 30$. The gap between SRHT (grey) and Learned (red) widens exponentially; at layer 30 it is the $18.4\times$ that the theory predicts.

Y-axis is log-scaled. Survival factor $\beta^L$ is the expected $\cos^2 \theta$ of the attention signal under 1-bit quantisation; under the moment-ratio theorem it equals $\beta$ asymptotically, and compounds multiplicatively through the residual stream.

Head-to-head on Gemma-4 26B MoE. Holding the inference pipeline fixed (RoPE, attention kernel, bf16 everywhere) and swapping only the rotation matrix: learned $L^1$-max at 1 bit yields WikiText-2 PPL $= 25.74$; SRHT at 1 bit yields PPL $= 942.42$. Ratio: $36.6\times$. The ratio of ratios, in log space, is $\ln 18.4 / \ln 36.6 = 0.809$: the moment-ratio theorem accounts for 81% of the log-PPL gap. The remaining 19% is Lean-attributable to a Frobenius-average amplification bound on the output projection $W_o$ (WoAmplification.lean, fully proved).

Beating the floor with a top-$d/8$ sparse correction

The moment-ratio floor is a lower bound on residual energy within the orthogonal group. Breaching it requires a non-orthogonal primitive. We use the simplest one available: a top-$k$ fp16 exception list. Per token, we store the $k$ largest-magnitude coordinates of the rotated key in fp16 and the remaining $d - k$ in 1 bit.

Each dot is one operating point on Gemma-4 26B MoE + WikiText-2 (10-sequence evaluation). The sweet spot $k = d/8$ gives 2.56 bits/entry at PPL = 3.00 — a 36% PPL overhead over FP16 at 6.24× compression. Dotted line: the top-$k$ frontier. The unrotated 1-bit baseline (not shown; PPL $\approx 10^6$) would be off the log scale.

The interesting part: when we measured the 1-bit quantization residual $\varepsilon = \tilde{k} - \hat{k}_{1\text{bit}}$ directly, we expected it to concentrate on a few coordinates (so that top-$k$ exceptions could strip away the dominant error). Both forms of that hypothesis are false: the top-$d/16$ coordinates of $\lvert \varepsilon \rvert$ carry only $32\%$ of $\lVert \varepsilon \rVert^2$, and the top-$d/16$ principal directions of $\mathrm{Cov}(\varepsilon)$ carry only $14\%$. The residual is essentially full-dimensional. Yet top-$d/8$ exceptions still cut PPL by $8.6\times$.

The reconciliation: top-$k$ does not minimise $\lVert \varepsilon \rVert^2$. It minimises the score perturbation $(q^\top)(k - \hat{k})$ on random queries, which weights each coordinate by the key’s magnitude rather than the residual’s. Preserving the top-$k$ by key magnitude protects the score-dominant directions, even though the residual itself is dense. This is why the primitive works.


The rotation is a gauge degree of freedom

Inserting an orthogonal $R$ before $W_K$ and absorbing $R^\top$ into the subsequent $W_K$ is an exact no-op on attention scores — the dot product $q^\top k = (Rq)^\top (Rk)$ is invariant. We Lean-verify this as moe_gauge_invariance in MoEGauge.lean: the rotation is a gauge degree of freedom of the attention head, and the learned $L^1$-max rotation is a specific gauge fix — the element of the orthogonal group that minimises the expected phase-collapse angle.

One consequence, operational: the rotation costs zero at inference time if we absorb it into $W_K$ post-training. We did not do so in our benchmark because we wanted to measure the $R$–quantise–$R^\top$ round-trip’s numerical cost directly; a deployment run would strip the matmul.

A second consequence, scientific: the reason MoE specifically needs a data-dependent rotation is that sparse expert routing breaks the single-coordinate-frame assumption of the dense transformer. Each of the $E = 128$ experts writes its output into a potentially distinct subframe of the residual stream; the downstream $W_K$ has to learn a single map whose rotation absorbs the disagreement. The resulting projection is contorted: mass ends up on directions where the $L^1$-to-$L^2$ ratio is low. The learned rotation re-aligns the frame.


Committed predictions before Mixtral and DeepSeek-MoE runs

Pre-registration. We pre-registered cross-architecture predictions for Mixtral and DeepSeek-MoE at commit 54ae566, frozen before running any measurement on those models. The predictions are archived at predictions_preregistered_20260419.json. The target criterion is Kendall-τ ≥ 0.60 on the fragmentation-fraction ordering across architectures. This pre-registration prevents post-hoc fitting of the moment-ratio prediction to newly observed data.

The null hypothesis is that a learned $L^1$-max rotation on any MoE model with $E \geq 8$ experts will push $\beta$ past $2/\pi$ toward the uniform-distribution limit. The moment-ratio compounding argument is architecture-agnostic modulo expert count and head dimension; the pre-registered ordering prediction makes this falsifiable before measurement.


Where does the recovery come from?

The end-to-end improvement from unrotated 1-bit (PPL $\approx 968,762$) to the Pareto sweet spot (PPL $= 3.00$) has two distinct sources, with neither alone sufficient:

3.7×104
PPL improvement
from learned rotation alone
(968,762 → 25.74)
8.6×
further improvement
from sparse exceptions
(25.74 → 3.00)

The rotation is necessary: without it, 1-bit is useless ($12\times$ worse than FP16). The sparse exceptions are powerful: they close an $8.6\times$ gap despite the quantization residual being essentially full-dimensional (effective rank $\approx 245$ out of $d = 256$). The reconciliation is that the top-$k$ correction operates on key magnitude, not residual magnitude, protecting the attention-score-dominant directions even when the residual’s $L^2$ mass is spread uniformly.

Dense-vs-MoE matched histogram across four Gemma-4 variants
Dense-vs-MoE matched histogram across four Gemma-4 variants (T2.2, 2026-04-19).

What is Lean-verified and what is not

The theorem body — per-vector identity, distribution-free sandwich, reference-table specialisations, gauge invariance, $W_o$ amplification bound — is mechanically verified in Lean 4 + Mathlib. The measure-theoretic plumbing that lifts the SLLN limit from almost-sure to expected value contains two sorry placeholders in MomentRatioFloor.lean; we flag them as non-load-bearing for the paper’s quantitative claims, since every specialised bound (Gaussian, uniform, Laplace) is proved directly. As of 2026-04-19, JensenFloor.lean and MoEGauge.lean are both zero-sorry.

Paper-critical sorry count. Zero on the reference-table theorems (gaussian_moment_ratio_floor_eq, uniform_moment_ratio_floor_eq, laplace_moment_ratio_floor_eq). Zero on the gauge-invariance claim (MoEGauge.lean fully closed 2026-04-19, commit 2c5c4d2). Zero on the $W_o$ Frobenius-average amplification identity. Zero on the per-vector phase-collapse identities (JensenFloor.lean fully closed 2026-04-19, commit 9460665). Two on the measure-theoretic core of the main asymptotic theorem; these are deferred to follow-up Lean work. See the Lean status page for the full declaration list.

What we did not study

Value-cache quantization. This paper is a K-cache-only result. V-cache quantization is not addressed here. The moment-ratio framework applies symmetrically to V, but V enters the attention update additively rather than multiplicatively. The compounding argument (the $\beta^L$ geometric gap) exploits the multiplicative structure of per-layer attention-signal decay, which does not carry over to additive V perturbations. A separate analysis is required. Papers comparing to ours on full KV-cache compression should do so at the 1-bit-K + fp16-V operating point.

Gemma-4 scope. All perplexity numbers are on a 10-sequence WikiText-2 slice of the Gemma-4 model family (e4b dense, 31B dense, e2b dense, 26B MoE). The e2b PPL numbers carry a caveat: measurements use a 4-bit-NF4 baseline, making direct PPL comparison to the MoE’s true-fp16 baseline non-trivial. The $\beta = 0.6746$ fragmentation statistic for e2b is clean and unaffected.

Training-time quantization. All results are post-training. Whether a moment-ratio regulariser during training (explicitly targeting $\beta \to 3/4$) would unlock higher post-quant quality is an open question.

Cross-architecture generalisation. The theorem depends only on $\beta$, which is a measurable per-layer statistic of any model’s K-cache. Cross-architecture predictions for Mixtral and DeepSeek-MoE are pre-registered (see Pre-registered predictions section above). A failure to confirm would be a non-trivial discovery motivating a stronger $\beta$-characterization theorem.

Sub-1-bit regimes. Ternary ${-1, 0, +1}$ and higher-cardinality vector codebooks are candidates for beating the $1-\beta$ floor. The moment-ratio argument does not cover them; a direct extension would characterise $1 - \beta^{(3)}$ for ternary and compare compounding decays.


Cite this work

@misc{basu2026phasecollapse, title = {Phase-Collapse Defragmentation: A Moment-Ratio Framework for 1-bit {KV}-Cache Quantization in Mixture-of-Experts Transformers}, author = {Basu, Debanjan}, year = {2026}, url = {https://github.com/d3banjan/moe-gauge-paper} }

References. Zandieh et al. (2026) TurboQuant; Chee et al. (2023) QuIP; Tseng et al. (2024) QuIP#; Liu et al. (2024) KIVI; Hooper et al. (2024) KVQuant; Ashkboos et al. (2024) QuaRot; Liu et al. (2024) SpinQuant; Jiang et al. (2024) Mixtral; Dai et al. (2024) DeepSeek-MoE; Lean 4 + Mathlib.