Phase-Collapse Defragmentation
The K-cache of large MoE transformers is drifting out of the 1-bit quantization-safe zone. V-cache is explicitly out of scope. MoE routing is where the K-cache pathology breaks catastrophically. We proved a floor and showed how to beat it.
At one bit per entry, attention depends on the angle between each key vector and its sign-quantised image. Across four Gemma-4 variants we measure a clean ordinal fragmentation spectrum: the per-layer phase-collapse angle $\bar{\theta}_\ell$ drifts toward the quantisation-unsafe regime as the architecture adds complexity. The 26B MoE variant is already past the failure boundary. Data-independent rotations (the Randomized Hadamard Transform used by TurboQuant’s PolarQuant stage) cannot save it — the Central Limit Theorem pins them at the Gaussian attractor $\beta = 2/\pi$. A learned $L^1$-max rotation escapes this attractor to $\beta = 0.702$ and the 1-bit perplexity drops from $942$ to $26$.
Each panel is one architecture. Points are coloured by the bit budget the horizontal-slice allocator assigned to that layer (blue = 1-bit pressure-release, green = 2-bit ballroom, orange = 3-bit fragmented). Dashed line is the 44° ballroom cutoff. MoE is the only variant with a majority of layers above the cutoff.
Layer-0 phase-collapse angles. Fragmented means $\bar{\theta}_\ell > 44^\circ$ — the horizontal-slice allocator assigns 3 bits rather than 2. Only e4b has an in-spec headline angle; the MoE sits 7.2° above it.
SRHT vs learned rotation
β-compounding over 30 layers
by the moment-ratio theorem
at PPL = 3.00 (Pareto sweet spot)
Background
Why 1-bit, why MoE, why now
At long context, the KV cache dominates the memory footprint of autoregressive inference. Production quantizers currently live at 2–4 bits per entry: TurboQuant reports near-optimal distortion at 3.5 bits and marginal degradation at 2.5 bits. Pushing below 2 bits — to a single sign bit per coordinate plus a sparse fp16 exception list — is a $2\times$ compression on top of that, and it is the aggressive regime where the published rotation-based tricks (QuIP, QuIP#, QuaRot, SpinQuant) start to show their seams.
The seam that this paper widens is geometric. A 1-bit sign quantiser replaces each coordinate $k_i$ with $\operatorname{sign}(k_i) \cdot \alpha$, where $\alpha = \lVert k \rVert_1 / d$. The angle $\theta_k$ between $k$ and its reconstruction controls the residual energy:
The upper bound $\cos \theta_k = 1$ is only attained if $k$ lies exactly on the sign lattice ${\pm \alpha}^d$ — a measure-zero event for any continuous distribution. So some energy is always lost. The quantitative amount is set by the distribution, and this is where MoE routing breaks the picture.
The theorem
A one-number fingerprint of 1-bit quality
Let $K = (K_1, \ldots, K_d)$ be a rotated key with identically distributed, symmetric coordinates. Define the moment ratio
By Cauchy–Schwarz, $\beta \in (0, 1]$ with $\beta = 1$ only on the sign lattice. The central result of the paper: under marginal-identity (not independence) and the strong law of large numbers for $\lvert K_1 \rvert$ and $K_1^2$,
This is the asymptotic moment-ratio floor (iid_coord_moment_ratio_floor, MomentRatioFloor.lean:155). On Gemma-4 26B MoE we measure $\beta = 0.702$, predicting a floor of $29.80\%$; the directly-measured $\mathbb{E}[\sin^2 \bar{\theta}]$ is $29.84\%$, matching to 0.05 percentage points.
Below is a live reference table: move the slider to pick a target $\beta$ and watch the distribution morph through the Laplace (heavy-tailed), Gaussian (CLT attractor), measured-MoE, and uniform (light-tailed) regimes. The 1 - β floor updates in real time.
Distribution shape is a symmetric generalised Gaussian with exponent $p$ chosen so that $\mu_1^2/\mu_2 = \beta$. The chip "floor $1-\beta$" is the asymptotic expected residual energy at 1-bit.
Compounding
Six points of β compound to 18× at depth 30
The moment-ratio gap between SRHT and learned rotations looks small — only $0.702 - 0.637 = 0.065$ in $\beta$. But attention signals compound multiplicatively across layers of residual propagation. Over $L = 30$ MoE layers,
Press play below to watch the three survival curves $\beta^L$ unroll from $L = 0$ to $L = 30$. The gap between SRHT (grey) and Learned (red) widens exponentially; at layer 30 it is the $18.4\times$ that the theory predicts.
Y-axis is log-scaled. Survival factor $\beta^L$ is the expected $\cos^2 \theta$ of the attention signal under 1-bit quantisation; under the moment-ratio theorem it equals $\beta$ asymptotically, and compounds multiplicatively through the residual stream.
WoAmplification.lean, fully proved).Engineering
Beating the floor with a top-$d/8$ sparse correction
The moment-ratio floor is a lower bound on residual energy within the orthogonal group. Breaching it requires a non-orthogonal primitive. We use the simplest one available: a top-$k$ fp16 exception list. Per token, we store the $k$ largest-magnitude coordinates of the rotated key in fp16 and the remaining $d - k$ in 1 bit.
Each dot is one operating point on Gemma-4 26B MoE + WikiText-2 (10-sequence evaluation). The sweet spot $k = d/8$ gives 2.56 bits/entry at PPL = 3.00 — a 36% PPL overhead over FP16 at 6.24× compression. Dotted line: the top-$k$ frontier. The unrotated 1-bit baseline (not shown; PPL $\approx 10^6$) would be off the log scale.
The interesting part: when we measured the 1-bit quantization residual $\varepsilon = \tilde{k} - \hat{k}_{1\text{bit}}$ directly, we expected it to concentrate on a few coordinates (so that top-$k$ exceptions could strip away the dominant error). Both forms of that hypothesis are false: the top-$d/16$ coordinates of $\lvert \varepsilon \rvert$ carry only $32\%$ of $\lVert \varepsilon \rVert^2$, and the top-$d/16$ principal directions of $\mathrm{Cov}(\varepsilon)$ carry only $14\%$. The residual is essentially full-dimensional. Yet top-$d/8$ exceptions still cut PPL by $8.6\times$.
The reconciliation: top-$k$ does not minimise $\lVert \varepsilon \rVert^2$. It minimises the score perturbation $(q^\top)(k - \hat{k})$ on random queries, which weights each coordinate by the key’s magnitude rather than the residual’s. Preserving the top-$k$ by key magnitude protects the score-dominant directions, even though the residual itself is dense. This is why the primitive works.
Gauge formalisation
The rotation is a gauge degree of freedom
Inserting an orthogonal $R$ before $W_K$ and absorbing $R^\top$ into the subsequent $W_K$ is an exact no-op on attention scores — the dot product $q^\top k = (Rq)^\top (Rk)$ is invariant. We Lean-verify this as moe_gauge_invariance in MoEGauge.lean: the rotation is a gauge degree of freedom of the attention head, and the learned $L^1$-max rotation is a specific gauge fix — the element of the orthogonal group that minimises the expected phase-collapse angle.
One consequence, operational: the rotation costs zero at inference time if we absorb it into $W_K$ post-training. We did not do so in our benchmark because we wanted to measure the $R$–quantise–$R^\top$ round-trip’s numerical cost directly; a deployment run would strip the matmul.
A second consequence, scientific: the reason MoE specifically needs a data-dependent rotation is that sparse expert routing breaks the single-coordinate-frame assumption of the dense transformer. Each of the $E = 128$ experts writes its output into a potentially distinct subframe of the residual stream; the downstream $W_K$ has to learn a single map whose rotation absorbs the disagreement. The resulting projection is contorted: mass ends up on directions where the $L^1$-to-$L^2$ ratio is low. The learned rotation re-aligns the frame.
Pre-registered predictions
Committed predictions before Mixtral and DeepSeek-MoE runs
54ae566, frozen before running any measurement on those models. The predictions are archived at predictions_preregistered_20260419.json. The target criterion is Kendall-τ ≥ 0.60 on the fragmentation-fraction ordering across architectures. This pre-registration prevents post-hoc fitting of the moment-ratio prediction to newly observed data.The null hypothesis is that a learned $L^1$-max rotation on any MoE model with $E \geq 8$ experts will push $\beta$ past $2/\pi$ toward the uniform-distribution limit. The moment-ratio compounding argument is architecture-agnostic modulo expert count and head dimension; the pre-registered ordering prediction makes this falsifiable before measurement.
Decomposition
Where does the recovery come from?
The end-to-end improvement from unrotated 1-bit (PPL $\approx 968,762$) to the Pareto sweet spot (PPL $= 3.00$) has two distinct sources, with neither alone sufficient:
from learned rotation alone
(968,762 → 25.74)
from sparse exceptions
(25.74 → 3.00)
The rotation is necessary: without it, 1-bit is useless ($12\times$ worse than FP16). The sparse exceptions are powerful: they close an $8.6\times$ gap despite the quantization residual being essentially full-dimensional (effective rank $\approx 245$ out of $d = 256$). The reconciliation is that the top-$k$ correction operates on key magnitude, not residual magnitude, protecting the attention-score-dominant directions even when the residual’s $L^2$ mass is spread uniformly.
Formal coverage
What is Lean-verified and what is not
The theorem body — per-vector identity, distribution-free sandwich, reference-table specialisations, gauge invariance, $W_o$ amplification bound — is mechanically verified in Lean 4 + Mathlib. The measure-theoretic plumbing that lifts the SLLN limit from almost-sure to expected value contains two sorry placeholders in MomentRatioFloor.lean; we flag them as non-load-bearing for the paper’s quantitative claims, since every specialised bound (Gaussian, uniform, Laplace) is proved directly. As of 2026-04-19, JensenFloor.lean and MoEGauge.lean are both zero-sorry.
gaussian_moment_ratio_floor_eq, uniform_moment_ratio_floor_eq, laplace_moment_ratio_floor_eq). Zero on the gauge-invariance claim (MoEGauge.lean fully closed 2026-04-19, commit 2c5c4d2). Zero on the $W_o$ Frobenius-average amplification identity. Zero on the per-vector phase-collapse identities (JensenFloor.lean fully closed 2026-04-19, commit 9460665). Two on the measure-theoretic core of the main asymptotic theorem; these are deferred to follow-up Lean work. See the Lean status page for the full declaration list.Scope
What we did not study
Value-cache quantization. This paper is a K-cache-only result. V-cache quantization is not addressed here. The moment-ratio framework applies symmetrically to V, but V enters the attention update additively rather than multiplicatively. The compounding argument (the $\beta^L$ geometric gap) exploits the multiplicative structure of per-layer attention-signal decay, which does not carry over to additive V perturbations. A separate analysis is required. Papers comparing to ours on full KV-cache compression should do so at the 1-bit-K + fp16-V operating point.
Gemma-4 scope. All perplexity numbers are on a 10-sequence WikiText-2 slice of the Gemma-4 model family (e4b dense, 31B dense, e2b dense, 26B MoE). The e2b PPL numbers carry a caveat: measurements use a 4-bit-NF4 baseline, making direct PPL comparison to the MoE’s true-fp16 baseline non-trivial. The $\beta = 0.6746$ fragmentation statistic for e2b is clean and unaffected.
Training-time quantization. All results are post-training. Whether a moment-ratio regulariser during training (explicitly targeting $\beta \to 3/4$) would unlock higher post-quant quality is an open question.
Cross-architecture generalisation. The theorem depends only on $\beta$, which is a measurable per-layer statistic of any model’s K-cache. Cross-architecture predictions for Mixtral and DeepSeek-MoE are pre-registered (see Pre-registered predictions section above). A failure to confirm would be a non-trivial discovery motivating a stronger $\beta$-characterization theorem.
Sub-1-bit regimes. Ternary ${-1, 0, +1}$ and higher-cardinality vector codebooks are candidates for beating the $1-\beta$ floor. The moment-ratio argument does not cover them; a direct extension would characterise $1 - \beta^{(3)}$ for ternary and compare compounding decays.
Citation
Cite this work
References. Zandieh et al. (2026) TurboQuant; Chee et al. (2023) QuIP; Tseng et al. (2024) QuIP#; Liu et al. (2024) KIVI; Hooper et al. (2024) KVQuant; Ashkboos et al. (2024) QuaRot; Liu et al. (2024) SpinQuant; Jiang et al. (2024) Mixtral; Dai et al. (2024) DeepSeek-MoE; Lean 4 + Mathlib.