● paper · v4 · 2026-04-24

Phase collapse is the pathology. Structured rotations are the lever.

Phase collapse — the drift of K-cache coordinates away from the sign lattice — is an emergent diagnostic for 1-bit K-cache quantization viability in Mixture-of-Experts transformers. (V-cache quantization is deferred to future work.) The moment-ratio floor is $\beta=2/\pi\approx0.637$; its one-bit cosine form is $\sqrt{\beta}\approx0.798$. Verified centered learned measurements span 0.919–0.965, while the supported deployable mechanism is structural conditioning with cond=1 SRHT-style rotations.

D. Basu · Gemma-4 e2b / e4b / 26B-MoE · DeepSeek-MoE-16B · OLMoE-1B-7B · paper · code · data

0.919–0.965

Learned + centered β
four architectures

0.7979

1-bit cosine √β
β floor = 2/π

45–53°

Phase-collapse angle spectrum
e4b → 26B-MoE

36.6×

PPL gap SRHT → learned
FP16 2.2 → SRHT 942 → learned 25.7

§1 Universal β lift

One centered audit per matrix. Four architectures. One pattern.

The moment ratio $\beta = (\mathbb{E}|K_1|)^2/\mathbb{E}[K_1^2]$ is the single number that governs 1-bit sign-quantization quality as head dimension grows. By Cauchy–Schwarz, $\beta \leq 1$; the asymptotic residual energy is $1-\beta$. A data-independent SRHT rotation is pinned to the Gaussian moment-ratio floor $2/\pi\approx0.637$ (one-bit cosine $\sqrt{2/\pi}\approx0.798$). Learned $L^1$-max with DC-mean centering clears 0.92 in the verified centered measurements, but the current mechanism claim is structural conditioning, not a data-dependent advantage over random orthogonal rotations.

model	β_raw	β_cen	β_SRHT	β_learned	Δβ	PPL fp16	PPL 1-bit learned
loading…

β values from _data/site_data.json. Δβ is color-intensity-coded: hotter = larger lift. PPL entries marked "—" are pending (see ppl_ablation.json). The 36.6× PPL gap on Gemma-4 26B MoE is the compounding-across-layers penalty SRHT pays and learned-R avoids.

§2 Convergence dashboard

β gets to 0.85 for free. The last mile is quadratically hard.

Per-model β vs training step, with thresholds at β = 0.85 / 0.90 / 0.92. All models reach 0.85 by step 50; OLMoE (d=2048) clears 0.92 by step 200, Gemma-4 e4b (d=512) takes 500, and Gemma-4 e2b (d=256) stalls just under. Theorems beta_last_mile_quadratic_hardness and beta_hessian_nonneg formalize the quadratic cost near the sign lattice.

loading convergence curves…

Dashed cyan: SRHT floor. Vermillion dots: step where β first exceeds 0.85, 0.90, 0.92.

§3 FFN extension · v4

Dense MLPs yield. MoE expert banks push back.

Extending learned-R from K-cache to feed-forward weights works cleanly for dense MLP matrices (β_eff 0.92–0.94 uniformly across 30 layers, gate/up/down). It hits a structural wall on MoE expert matrices — β plateaus at ≈ 0.83. We call this Stiefel frustration: 128 experts occupy mutually orthogonal-ish bases that no single rotation can simultaneously align. The phenomenon is as much an architectural discovery about MoE latent geometry as a quantization recipe.

MLP victory

β_eff for gate_proj, up_proj, down_proj across 30 layers. The figure is flat-in-depth: no layer is meaningfully harder than any other.

The Stiefel frustration

For MoE expert banks, β saturates near 0.83 regardless of rotation training budget. Geometrically, hundreds of experts occupy mutually near-orthogonal subspaces on the Stiefel manifold; one rotation cannot align them all. Full explainer: explore page.

§4 RWKV replacement branch

Algebra passed. Replacement now has a serial kill ladder.

The finite-feature receptacle result is now separated from the stronger architecture-replacement claim. RW0 passed as algebraic experimentation; RW1-RW3 are queued so that generalization, carrier choice, and live model insertion fail fast before downstream compute.

rung	status	success gate	kill gate	artifact
RW0	PASS	Algebraic feasibility: exact replay max rel L2 <= 0.00857 and m=64 output rel L2 = 0.0856/0.1419/0.0316 on layers 0/8/15.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.r0.v1.json`
RW1.0	KILL	Same RW1 held-out gate with no task-trained feature map: replay <= 1e-2, all layers monotone, final max rel L2 <= 0.10, gap <= 0.04.	Frozen task-independent features still have monotone layers < 2 or final held-out max rel L2 > 0.14.	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p2.v1.json`
RW1.1	KILL	Replay <= 1e-2, held-out ladder monotone on all sampled layers, final max rel L2 <= 0.10, train/validation gap <= 0.04.	Replay passes but monotone layers < 2 or final held-out max rel L2 > 0.14.	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1.v1.json`
RW1.2	KILL	Same RW1 held-out gate after trimming the feature search: replay <= 1e-2, all layers monotone, final max rel L2 <= 0.10, gap <= 0.04.	Replay passes but shared-head anchored features still have monotone layers < 2 or final held-out max rel L2 > 0.14.	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p1.v1.json`
RW1.3	KILL	Same RW1 held-out gate with unchanged recurrent state size and smoother head-aware parameterization.	Shared basis plus per-head gates still has monotone layers < 2 or final held-out max rel L2 > 0.14.	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p3.v1.json`
RW1.4	KILL	Source-side before/after layer mean field plus first-order corrections passes RW1 held-out gate: replay <= 1e-2, all layers monotone, final max rel L2 <= 0.10, gap <= 0.04.	Source-layer mean field still has monotone layers < 2 or final held-out max rel L2 > 0.14; second-order correction audit also fails if layer 8 remains high.	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4.v1.json`
RW1.4.1	KILL	Analytic source-interaction expansion around RW1.4 mean field: mean field plus first-order fluctuation response plus second-order pair/correlation correction.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p1.layer8.v1.json`
RW1.4.2	KILL	Constrained nonlocal history propagator after RW1.4.1 local source expansion, inspired by a sum-over-histories/action correction but emitted only as a byte-accounted diagnostic.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p2.layer8.v1.json`
RW1.4.3	KILL	Finite quadratic action-derived nonlocal history propagator after RW1.4.1 local source expansion.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p3.layer8.v1.json`
RW1.4.4	KILL	Noether-style invariant gate on the RW1.4.3 action-derived nonlocal propagator.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p4.layer8.v1.json`
RW1.4.5	KILL	Constrained Noether-tangent action basis for the RW1.4.4 action branch: project the correction space before fitting/evaluation instead of auditing leaks only after fitting.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p5.layer8.v1.json`
RW1.4.6	KILL	Constrained variational action solve inside the Noether tangent manifold from the start. This is the postmortem opening after RW1.4.5: solve in the constrained manifold, not solve first and project afterward.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p6.layer8.v1.json`
RW1.4.7	KILL	Global all-layer source-field diagnostic for the hypothesis that replacing a correlated transformer layer requires a sufficient statistic over the whole source stack, approximating an RG-like fixed point rather than a local source neighborhood.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p7.layer8.v1.json`
RW1.4.8	KILL	Oracle full-fidelity all-layer source receptacle: preserve every non-target source layer as layer-resolved recurrent source events instead of compressing them to a mean field.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p8.layer8.v1.json`
RW1.5	KILL	Layer-8-first cross-fitted structured kernel mixture with fit/guard/locked splits, gap penalty, denominator stability penalty, and Lean GatePass certificate emission.	already passed	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p5.layer8.v1.json`
RW2	BLOCKED_RW1X_NO_PASS	Decayed carrier remains within +0.015 rel L2 of RW1.4 final max, replay passes, denominators stay stable, and at least two layers are monotone.	Decayed carrier is worse than RW1.4 by >0.05 rel L2 or denominator stability fails.	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw2.v1.json`
RW3	BLOCKED_RW1X_NO_PASS	Single patched layer has mean NLL inflation <= 2.0% and max logit rel L2 <= 0.10.	Mean NLL inflation > 5.0% or max logit rel L2 > 0.25.	`cross-check/rwkv-receptacle/olmoe_attention_receptacle_block_swap.rw3.v1.json`

Queue command: uv run python cross-check/rwkv-receptacle/rw_replacement_ladder_queue.py --run --stop RW3. The queue stops on KILL or INCONCLUSIVE: RW2 does not launch unless RW1 passes, and RW3 does not launch unless RW2 passes.

§5 Interactive probes

Three tweening demos. Same scaffold as the V-cache widgets.

These feel like physics demos, not animations: drag, read the β readout, think. Below 500 px the SVG body falls back to a static snapshot.

§6 Lean status

230 theorems proved. 16 in flight.

Load-bearing theorems on the β framework, centering lift, and residual correction — proved in Lean 4 + Mathlib. Full theorem-by-theorem status lives on the Lean page and the theorems appendix; the totals below are pulled live from _data/lean_status.json.

loading lean counts…

"Deferred" proofs are blocked on a pending Mathlib lemma or a principled analytic sorry. "Stub" entries have a complete statement but an unfinished body. See lean.html for the cross-link matrix between theorems and paper sections.

§7 Paper artifacts

Paper · code · rotations.

paper · pdf

Phase-Collapse Defragmentation

v4 · 2026-04-24 · 32 pp

arxiv

pre-print (placeholder)

submission pending

repo · github

moe-gauge-paper

calibration · training · eval

data · huggingface

Learned R matrices

⚠ 744 MB — all models, all layers

geometry · derivations · widgets

BibTeX

@misc{basu2026phasecollapse,
  title  = {Phase-Collapse Defragmentation: A Moment-Ratio Framework for
            1-Bit KV-Cache Quantization in Mixture-of-Experts Transformers},
  author = {Basu, Debanjan},
  year   = {2026},
  url    = {https://github.com/d3banjan/moe-gauge-paper}
}