Phase-Collapse Defragmentation
The Key-cache of every large transformer is drifting out of the quantization-safe zone. MoE routing is where this breaks catastrophically. We proved a floor and showed how to beat it.
At one bit per entry, attention depends on the angle between each key vector and its sign-quantised image. Across four Gemma-4 variants we measure a clean ordinal fragmentation spectrum: the per-layer phase-collapse angle $\bar{\theta}_\ell$ drifts toward the quantisation-unsafe regime as the architecture adds complexity. The 26B MoE variant is already past the failure boundary. Data-independent rotations (the Randomized Hadamard Transform used by TurboQuant’s PolarQuant stage) cannot save it — the Central Limit Theorem pins them at the Gaussian attractor $\beta = 2/\pi$. A learned $L^1$-max rotation escapes this attractor to $\beta = 0.702$ and the 1-bit perplexity drops from $942$ to $26$.
Each panel is one architecture. Points are coloured by the bit budget the horizontal-slice allocator assigned to that layer (blue = 1-bit pressure-release, green = 2-bit ballroom, orange = 3-bit fragmented). Dashed line is the 44° ballroom cutoff. MoE is the only variant with a majority of layers above the cutoff.
Layer-0 phase-collapse angles. Fragmented means $\bar{\theta}_\ell > 44^\circ$ — the horizontal-slice allocator assigns 3 bits rather than 2. Only e4b has an in-spec headline angle; the MoE sits 7.2° above it.
SRHT vs learned rotation
β-compounding over 30 layers
by the moment-ratio theorem
at PPL = 3.00 (Pareto sweet spot)
Background
Why 1-bit, why MoE, why now
At long context, the KV cache dominates the memory footprint of autoregressive inference. Production quantizers currently live at 2–4 bits per entry: TurboQuant reports near-optimal distortion at 3.5 bits and marginal degradation at 2.5 bits. Pushing below 2 bits — to a single sign bit per coordinate plus a sparse fp16 exception list — is a $2\times$ compression on top of that, and it is the aggressive regime where the published rotation-based tricks (QuIP, QuIP#, QuaRot, SpinQuant) start to show their seams.
The seam that this paper widens is geometric. A 1-bit sign quantiser replaces each coordinate $k_i$ with $\operatorname{sign}(k_i) \cdot \alpha$, where $\alpha = \lVert k \rVert_1 / d$. The angle $\theta_k$ between $k$ and its reconstruction controls the residual energy:
The upper bound $\cos \theta_k = 1$ is only attained if $k$ lies exactly on the sign lattice ${\pm \alpha}^d$ — a measure-zero event for any continuous distribution. So some energy is always lost. The quantitative amount is set by the distribution, and this is where MoE routing breaks the picture.
The theorem
A one-number fingerprint of 1-bit quality
Let $K = (K_1, \ldots, K_d)$ be a rotated key with identically distributed, symmetric coordinates. Define the moment ratio
By Cauchy–Schwarz, $\beta \in (0, 1]$ with $\beta = 1$ only on the sign lattice. The central result of the paper: under marginal-identity (not independence) and the strong law of large numbers for $\lvert K_1 \rvert$ and $K_1^2$,
This is the asymptotic moment-ratio floor (iid_coord_moment_ratio_floor, MomentRatioFloor.lean:155). On Gemma-4 26B MoE we measure $\beta = 0.702$, predicting a floor of $29.80\%$; the directly-measured $\mathbb{E}[\sin^2 \bar{\theta}]$ is $29.84\%$, matching to 0.05 percentage points.
Below is a live reference table: move the slider to pick a target $\beta$ and watch the distribution morph through the Laplace (heavy-tailed), Gaussian (CLT attractor), measured-MoE, and uniform (light-tailed) regimes. The 1 - β floor updates in real time.
Distribution shape is a symmetric generalised Gaussian with exponent $p$ chosen so that $\mu_1^2/\mu_2 = \beta$. The chip "floor $1-\beta$" is the asymptotic expected residual energy at 1-bit.
Compounding
Six points of β compound to 18× at depth 30
The moment-ratio gap between SRHT and learned rotations looks small — only $0.702 - 0.637 = 0.065$ in $\beta$. But attention signals compound multiplicatively across layers of residual propagation. Over $L = 30$ MoE layers,
Press play below to watch the three survival curves $\beta^L$ unroll from $L = 0$ to $L = 30$. The gap between SRHT (grey) and Learned (red) widens exponentially; at layer 30 it is the $18.4\times$ that the theory predicts.
Y-axis is log-scaled. Survival factor $\beta^L$ is the expected $\cos^2 \theta$ of the attention signal under 1-bit quantisation; under the moment-ratio theorem it equals $\beta$ asymptotically, and compounds multiplicatively through the residual stream.
WoAmplification.lean, fully proved).Engineering
Beating the floor with a top-$d/8$ sparse correction
The moment-ratio floor is a lower bound on residual energy within the orthogonal group. Breaching it requires a non-orthogonal primitive. We use the simplest one available: a top-$k$ fp16 exception list. Per token, we store the $k$ largest-magnitude coordinates of the rotated key in fp16 and the remaining $d - k$ in 1 bit.
Each dot is one operating point on Gemma-4 26B MoE + WikiText-2 (10-sequence evaluation). The sweet spot $k = d/8$ gives 2.56 bits/entry at PPL = 3.00 — a 36% PPL overhead over FP16 at 6.24× compression. Dotted line: the top-$k$ frontier. The unrotated 1-bit baseline (not shown; PPL $\approx 10^6$) would be off the log scale.
The interesting part: when we measured the 1-bit quantization residual $\varepsilon = \tilde{k} - \hat{k}_{1\text{bit}}$ directly, we expected it to concentrate on a few coordinates (so that top-$k$ exceptions could strip away the dominant error). Both forms of that hypothesis are false: the top-$d/16$ coordinates of $\lvert \varepsilon \rvert$ carry only $32\%$ of $\lVert \varepsilon \rVert^2$, and the top-$d/16$ principal directions of $\mathrm{Cov}(\varepsilon)$ carry only $14\%$. The residual is essentially full-dimensional. Yet top-$d/8$ exceptions still cut PPL by $8.6\times$.
The reconciliation: top-$k$ does not minimise $\lVert \varepsilon \rVert^2$. It minimises the score perturbation $(q^\top)(k - \hat{k})$ on random queries, which weights each coordinate by the key’s magnitude rather than the residual’s. Preserving the top-$k$ by key magnitude protects the score-dominant directions, even though the residual itself is dense. This is why the primitive works.
Gauge formalisation
The rotation is a gauge degree of freedom
Inserting an orthogonal $R$ before $W_K$ and absorbing $R^\top$ into the subsequent $W_K$ is an exact no-op on attention scores — the dot product $q^\top k = (Rq)^\top (Rk)$ is invariant. We Lean-verify this as moe_gauge_invariance in MoEGauge.lean: the rotation is a gauge degree of freedom of the attention head, and the learned $L^1$-max rotation is a specific gauge fix — the element of the orthogonal group that minimises the expected phase-collapse angle.
One consequence, operational: the rotation costs zero at inference time if we absorb it into $W_K$ post-training. We did not do so in our benchmark because we wanted to measure the $R$–quantise–$R^\top$ round-trip’s numerical cost directly; a deployment run would strip the matmul.
A second consequence, scientific: the reason MoE specifically needs a data-dependent rotation is that sparse expert routing breaks the single-coordinate-frame assumption of the dense transformer. Each of the $E = 128$ experts writes its output into a potentially distinct subframe of the residual stream; the downstream $W_K$ has to learn a single map whose rotation absorbs the disagreement. The resulting projection is contorted: mass ends up on directions where the $L^1$-to-$L^2$ ratio is low. The learned rotation re-aligns the frame.
Formal coverage
What is Lean-verified and what is not
The theorem body — per-vector identity, distribution-free sandwich, reference-table specialisations, gauge invariance, $W_o$ amplification bound — is mechanically verified in Lean 4 + Mathlib. The measure-theoretic plumbing that lifts the SLLN limit from almost-sure to expected value contains two sorry placeholders in MomentRatioFloor.lean; we flag them as non-load-bearing for the paper’s quantitative claims, since every specialised bound (Gaussian, uniform, Laplace) is proved directly.
gaussian_moment_ratio_floor_eq, uniform_moment_ratio_floor_eq, laplace_moment_ratio_floor_eq). Zero on the gauge-invariance claim. Zero on the $W_o$ Frobenius-average amplification identity. Two on the measure-theoretic core of the main asymptotic theorem; these are deferred to follow-up Lean work. See the Lean status page for the full declaration list.Scope
What we did not study
Value-cache quantization. We quantise only the K cache; V stays in bf16. The moment-ratio framework applies symmetrically to V, but V enters the attention update additively rather than multiplicatively, so the compounding argument does not carry over unchanged.
Training-time quantization. All results are post-training. Whether a moment-ratio regulariser during training (explicitly targeting $\beta \to 3/4$) would unlock higher post-quant quality is an open question.
Cross-architecture generalisation. The theorem depends only on $\beta$, which is a measurable per-layer statistic of any model’s K-cache. We have not yet run Mixtral, DeepSeek-MoE, or Grok. The MoE-gauge argument is architecture-agnostic modulo expert count, so the null hypothesis is that the prediction holds; a failure there would be a more interesting positive result than a confirmation.
Sub-1-bit regimes. Ternary ${-1, 0, +1}$ and higher-cardinality vector codebooks are candidates for beating the $1-\beta$ floor. The moment-ratio argument does not cover them; a direct extension would characterise $1 - \beta^{(3)}$ for ternary and compare compounding decays.
Citation
Cite this work
References. Zandieh et al. (2026) TurboQuant; Chee et al. (2023) QuIP; Tseng et al. (2024) QuIP#; Liu et al. (2024) KIVI; Hooper et al. (2024) KVQuant; Ashkboos et al. (2024) QuaRot; Liu et al. (2024) SpinQuant; Jiang et al. (2024) Mixtral; Dai et al. (2024) DeepSeek-MoE; Lean 4 + Mathlib.