Phase-Collapse Defragmentation

The Key-cache of every large transformer is drifting out of the quantization-safe zone. MoE routing is where this breaks catastrophically. We proved a floor and showed how to beat it.

At one bit per entry, attention depends on the angle between each key vector and its sign-quantised image. Across four Gemma-4 variants we measure a clean ordinal fragmentation spectrum: the per-layer phase-collapse angle $\bar{\theta}_\ell$ drifts toward the quantisation-unsafe regime as the architecture adds complexity. The 26B MoE variant is already past the failure boundary. Data-independent rotations (the Randomized Hadamard Transform used by TurboQuant’s PolarQuant stage) cannot save it — the Central Limit Theorem pins them at the Gaussian attractor $\beta = 2/\pi$. A learned $L^1$-max rotation escapes this attractor to $\beta = 0.702$ and the 1-bit perplexity drops from $942$ to $26$.

Each panel is one architecture. Points are coloured by the bit budget the horizontal-slice allocator assigned to that layer (blue = 1-bit pressure-release, green = 2-bit ballroom, orange = 3-bit fragmented). Dashed line is the 44° ballroom cutoff. MoE is the only variant with a majority of layers above the cutoff.

e4b dense
45.47°
13% fragmented
31B dense
48.73°
58% fragmented
e2b dense
50.20°
60% fragmented
26B MoE
52.67°
77% fragmented

Layer-0 phase-collapse angles. Fragmented means $\bar{\theta}_\ell > 44^\circ$ — the horizontal-slice allocator assigns 3 bits rather than 2. Only e4b has an in-spec headline angle; the MoE sits 7.2° above it.

36.6×
empirical PPL gap
SRHT vs learned rotation
18.4×
geometric prediction
β-compounding over 30 layers
81%
log-gap explained
by the moment-ratio theorem
6.24×
KV-cache compression
at PPL = 3.00 (Pareto sweet spot)

Why 1-bit, why MoE, why now

At long context, the KV cache dominates the memory footprint of autoregressive inference. Production quantizers currently live at 2–4 bits per entry: TurboQuant reports near-optimal distortion at 3.5 bits and marginal degradation at 2.5 bits. Pushing below 2 bits — to a single sign bit per coordinate plus a sparse fp16 exception list — is a $2\times$ compression on top of that, and it is the aggressive regime where the published rotation-based tricks (QuIP, QuIP#, QuaRot, SpinQuant) start to show their seams.

The seam that this paper widens is geometric. A 1-bit sign quantiser replaces each coordinate $k_i$ with $\operatorname{sign}(k_i) \cdot \alpha$, where $\alpha = \lVert k \rVert_1 / d$. The angle $\theta_k$ between $k$ and its reconstruction controls the residual energy:

$$ \frac{\lVert k - \hat{k} \rVert^2}{\lVert k \rVert^2} \;=\; \sin^2 \theta_k \;=\; 1 \;-\; \frac{\lVert k \rVert_1^2}{d \, \lVert k \rVert_2^2}. $$

The upper bound $\cos \theta_k = 1$ is only attained if $k$ lies exactly on the sign lattice ${\pm \alpha}^d$ — a measure-zero event for any continuous distribution. So some energy is always lost. The quantitative amount is set by the distribution, and this is where MoE routing breaks the picture.


A one-number fingerprint of 1-bit quality

Let $K = (K_1, \ldots, K_d)$ be a rotated key with identically distributed, symmetric coordinates. Define the moment ratio

$$ \beta \;=\; \frac{\mu_1^2}{\mu_2} \;=\; \frac{(\mathbb{E}|K_1|)^2}{\mathbb{E}[K_1^2]} \;\in\; (0, 1]. $$

By Cauchy–Schwarz, $\beta \in (0, 1]$ with $\beta = 1$ only on the sign lattice. The central result of the paper: under marginal-identity (not independence) and the strong law of large numbers for $\lvert K_1 \rvert$ and $K_1^2$,

$$ \lim_{d \to \infty} \mathbb{E}\!\left[\sin^2 \theta_K\right] \;=\; 1 - \beta. $$

This is the asymptotic moment-ratio floor (iid_coord_moment_ratio_floor, MomentRatioFloor.lean:155). On Gemma-4 26B MoE we measure $\beta = 0.702$, predicting a floor of $29.80\%$; the directly-measured $\mathbb{E}[\sin^2 \bar{\theta}]$ is $29.84\%$, matching to 0.05 percentage points.

Below is a live reference table: move the slider to pick a target $\beta$ and watch the distribution morph through the Laplace (heavy-tailed), Gaussian (CLT attractor), measured-MoE, and uniform (light-tailed) regimes. The 1 - β floor updates in real time.

Distribution shape is a symmetric generalised Gaussian with exponent $p$ chosen so that $\mu_1^2/\mu_2 = \beta$. The chip "floor $1-\beta$" is the asymptotic expected residual energy at 1-bit.

The attractor problem. Any data-independent orthogonal rotation acts as a summing operator on the coordinates of $k$. By the Central Limit Theorem the rotated marginals converge to Gaussian, dragging $\beta$ toward $2/\pi \approx 0.637$. Randomized Hadamard is always pulled to this value on high-dimensional data — it cannot escape the attractor. A learned rotation can, because the marginal-identity hypothesis is strictly weaker than independence. On low-effective-rank activations (MoE K-cache, stable rank $\approx 16$ out of $d = 256$) the learned $L^1$-max rotation pushes $\beta$ past Gaussian toward the uniform limit $3/4$. We call this the anti-kurtosis effect.

Six points of β compound to 18× at depth 30

The moment-ratio gap between SRHT and learned rotations looks small — only $0.702 - 0.637 = 0.065$ in $\beta$. But attention signals compound multiplicatively across layers of residual propagation. Over $L = 30$ MoE layers,

$$ \mathcal{G} \;=\; \left(\frac{\beta_{\text{learned}}}{\beta_{\text{SRHT}}}\right)^{L} \;=\; \left(\frac{0.702}{0.637}\right)^{30} \;\approx\; 18.4\times. $$

Press play below to watch the three survival curves $\beta^L$ unroll from $L = 0$ to $L = 30$. The gap between SRHT (grey) and Learned (red) widens exponentially; at layer 30 it is the $18.4\times$ that the theory predicts.

Y-axis is log-scaled. Survival factor $\beta^L$ is the expected $\cos^2 \theta$ of the attention signal under 1-bit quantisation; under the moment-ratio theorem it equals $\beta$ asymptotically, and compounds multiplicatively through the residual stream.

Head-to-head on Gemma-4 26B MoE. Holding the inference pipeline fixed (RoPE, attention kernel, bf16 everywhere) and swapping only the rotation matrix: learned $L^1$-max at 1 bit yields WikiText-2 PPL $= 25.74$; SRHT at 1 bit yields PPL $= 942.42$. Ratio: $36.6\times$. The ratio of ratios, in log space, is $\ln 18.4 / \ln 36.6 = 0.809$: the moment-ratio theorem accounts for 81% of the log-PPL gap. The remaining 19% is Lean-attributable to a Frobenius-average amplification bound on the output projection $W_o$ (WoAmplification.lean, fully proved).

Beating the floor with a top-$d/8$ sparse correction

The moment-ratio floor is a lower bound on residual energy within the orthogonal group. Breaching it requires a non-orthogonal primitive. We use the simplest one available: a top-$k$ fp16 exception list. Per token, we store the $k$ largest-magnitude coordinates of the rotated key in fp16 and the remaining $d - k$ in 1 bit.

Each dot is one operating point on Gemma-4 26B MoE + WikiText-2 (10-sequence evaluation). The sweet spot $k = d/8$ gives 2.56 bits/entry at PPL = 3.00 — a 36% PPL overhead over FP16 at 6.24× compression. Dotted line: the top-$k$ frontier. The unrotated 1-bit baseline (not shown; PPL $\approx 10^6$) would be off the log scale.

The interesting part: when we measured the 1-bit quantization residual $\varepsilon = \tilde{k} - \hat{k}_{1\text{bit}}$ directly, we expected it to concentrate on a few coordinates (so that top-$k$ exceptions could strip away the dominant error). Both forms of that hypothesis are false: the top-$d/16$ coordinates of $\lvert \varepsilon \rvert$ carry only $32\%$ of $\lVert \varepsilon \rVert^2$, and the top-$d/16$ principal directions of $\mathrm{Cov}(\varepsilon)$ carry only $14\%$. The residual is essentially full-dimensional. Yet top-$d/8$ exceptions still cut PPL by $8.6\times$.

The reconciliation: top-$k$ does not minimise $\lVert \varepsilon \rVert^2$. It minimises the score perturbation $(q^\top)(k - \hat{k})$ on random queries, which weights each coordinate by the key’s magnitude rather than the residual’s. Preserving the top-$k$ by key magnitude protects the score-dominant directions, even though the residual itself is dense. This is why the primitive works.


The rotation is a gauge degree of freedom

Inserting an orthogonal $R$ before $W_K$ and absorbing $R^\top$ into the subsequent $W_K$ is an exact no-op on attention scores — the dot product $q^\top k = (Rq)^\top (Rk)$ is invariant. We Lean-verify this as moe_gauge_invariance in MoEGauge.lean: the rotation is a gauge degree of freedom of the attention head, and the learned $L^1$-max rotation is a specific gauge fix — the element of the orthogonal group that minimises the expected phase-collapse angle.

One consequence, operational: the rotation costs zero at inference time if we absorb it into $W_K$ post-training. We did not do so in our benchmark because we wanted to measure the $R$–quantise–$R^\top$ round-trip’s numerical cost directly; a deployment run would strip the matmul.

A second consequence, scientific: the reason MoE specifically needs a data-dependent rotation is that sparse expert routing breaks the single-coordinate-frame assumption of the dense transformer. Each of the $E = 128$ experts writes its output into a potentially distinct subframe of the residual stream; the downstream $W_K$ has to learn a single map whose rotation absorbs the disagreement. The resulting projection is contorted: mass ends up on directions where the $L^1$-to-$L^2$ ratio is low. The learned rotation re-aligns the frame.


What is Lean-verified and what is not

The theorem body — per-vector identity, distribution-free sandwich, reference-table specialisations, gauge invariance, $W_o$ amplification bound — is mechanically verified in Lean 4 + Mathlib. The measure-theoretic plumbing that lifts the SLLN limit from almost-sure to expected value contains two sorry placeholders in MomentRatioFloor.lean; we flag them as non-load-bearing for the paper’s quantitative claims, since every specialised bound (Gaussian, uniform, Laplace) is proved directly.

Paper-critical sorry count. Zero on the reference-table theorems (gaussian_moment_ratio_floor_eq, uniform_moment_ratio_floor_eq, laplace_moment_ratio_floor_eq). Zero on the gauge-invariance claim. Zero on the $W_o$ Frobenius-average amplification identity. Two on the measure-theoretic core of the main asymptotic theorem; these are deferred to follow-up Lean work. See the Lean status page for the full declaration list.

What we did not study

Value-cache quantization. We quantise only the K cache; V stays in bf16. The moment-ratio framework applies symmetrically to V, but V enters the attention update additively rather than multiplicatively, so the compounding argument does not carry over unchanged.

Training-time quantization. All results are post-training. Whether a moment-ratio regulariser during training (explicitly targeting $\beta \to 3/4$) would unlock higher post-quant quality is an open question.

Cross-architecture generalisation. The theorem depends only on $\beta$, which is a measurable per-layer statistic of any model’s K-cache. We have not yet run Mixtral, DeepSeek-MoE, or Grok. The MoE-gauge argument is architecture-agnostic modulo expert count, so the null hypothesis is that the prediction holds; a failure there would be a more interesting positive result than a confirmation.

Sub-1-bit regimes. Ternary ${-1, 0, +1}$ and higher-cardinality vector codebooks are candidates for beating the $1-\beta$ floor. The moment-ratio argument does not cover them; a direct extension would characterise $1 - \beta^{(3)}$ for ternary and compare compounding decays.


Cite this work

@misc{basu2026phasecollapse, title = {Phase-Collapse Defragmentation: A Moment-Ratio Framework for 1-bit {KV}-Cache Quantization in Mixture-of-Experts Transformers}, author = {Basu, Debanjan}, year = {2026}, url = {https://github.com/d3banjan/moe-gauge-paper} }

References. Zandieh et al. (2026) TurboQuant; Chee et al. (2023) QuIP; Tseng et al. (2024) QuIP#; Liu et al. (2024) KIVI; Hooper et al. (2024) KVQuant; Ashkboos et al. (2024) QuaRot; Liu et al. (2024) SpinQuant; Jiang et al. (2024) Mixtral; Dai et al. (2024) DeepSeek-MoE; Lean 4 + Mathlib.