● paper · v4 · 2026-04-24
Phase collapse is the pathology. Structured rotations are the lever.
Phase collapse — the drift of K-cache coordinates away from the sign lattice — is an emergent diagnostic for 1-bit K-cache quantization viability in Mixture-of-Experts transformers. (V-cache quantization is deferred to future work.) The moment-ratio floor is $\beta=2/\pi\approx0.637$; its one-bit cosine form is $\sqrt{\beta}\approx0.798$. Verified centered learned measurements span 0.919–0.965, while the supported deployable mechanism is structural conditioning with cond=1 SRHT-style rotations.
four architectures
β floor = 2/π
e4b → 26B-MoE
FP16 2.2 → SRHT 942 → learned 25.7
§1 Universal β lift
One centered audit per matrix. Four architectures. One pattern.
The moment ratio $\beta = (\mathbb{E}|K_1|)^2/\mathbb{E}[K_1^2]$ is the single number that governs 1-bit sign-quantization quality as head dimension grows. By Cauchy–Schwarz, $\beta \leq 1$; the asymptotic residual energy is $1-\beta$. A data-independent SRHT rotation is pinned to the Gaussian moment-ratio floor $2/\pi\approx0.637$ (one-bit cosine $\sqrt{2/\pi}\approx0.798$). Learned $L^1$-max with DC-mean centering clears 0.92 in the verified centered measurements, but the current mechanism claim is structural conditioning, not a data-dependent advantage over random orthogonal rotations.
| model | β_raw | β_cen | β_SRHT | β_learned | Δβ | PPL fp16 | PPL 1-bit learned |
|---|---|---|---|---|---|---|---|
| loading… | |||||||
β values from _data/site_data.json. Δβ is color-intensity-coded: hotter = larger lift. PPL entries marked "—" are pending (see ppl_ablation.json). The 36.6× PPL gap on Gemma-4 26B MoE is the compounding-across-layers penalty SRHT pays and learned-R avoids.
§2 Convergence dashboard
β gets to 0.85 for free. The last mile is quadratically hard.
Per-model β vs training step, with thresholds at β = 0.85 / 0.90 / 0.92. All models reach 0.85 by step 50; OLMoE (d=2048) clears 0.92 by step 200, Gemma-4 e4b (d=512) takes 500, and Gemma-4 e2b (d=256) stalls just under. Theorems beta_last_mile_quadratic_hardness and beta_hessian_nonneg formalize the quadratic cost near the sign lattice.
Dashed cyan: SRHT floor. Vermillion dots: step where β first exceeds 0.85, 0.90, 0.92.
§3 FFN extension · v4
Dense MLPs yield. MoE expert banks push back.
Extending learned-R from K-cache to feed-forward weights works cleanly for dense MLP matrices (β_eff 0.92–0.94 uniformly across 30 layers, gate/up/down). It hits a structural wall on MoE expert matrices — β plateaus at ≈ 0.83. We call this Stiefel frustration: 128 experts occupy mutually orthogonal-ish bases that no single rotation can simultaneously align. The phenomenon is as much an architectural discovery about MoE latent geometry as a quantization recipe.
MLP victory
β_eff for gate_proj, up_proj, down_proj across 30 layers. The figure is flat-in-depth: no layer is meaningfully harder than any other.
The Stiefel frustration
For MoE expert banks, β saturates near 0.83 regardless of rotation training budget. Geometrically, hundreds of experts occupy mutually near-orthogonal subspaces on the Stiefel manifold; one rotation cannot align them all. Full explainer: explore page.
§4 RWKV replacement branch
Algebra passed. Replacement now has a serial kill ladder.
The finite-feature receptacle result is now separated from the stronger architecture-replacement claim. RW0 passed as algebraic experimentation; RW1-RW3 are queued so that generalization, carrier choice, and live model insertion fail fast before downstream compute.
| rung | status | success gate | kill gate | artifact |
|---|---|---|---|---|
| RW0 | PASS | Algebraic feasibility: exact replay max rel L2 <= 0.00857 and m=64 output rel L2 = 0.0856/0.1419/0.0316 on layers 0/8/15. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.r0.v1.json |
| RW1.0 | KILL | Same RW1 held-out gate with no task-trained feature map: replay <= 1e-2, all layers monotone, final max rel L2 <= 0.10, gap <= 0.04. | Frozen task-independent features still have monotone layers < 2 or final held-out max rel L2 > 0.14. | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p2.v1.json |
| RW1.1 | KILL | Replay <= 1e-2, held-out ladder monotone on all sampled layers, final max rel L2 <= 0.10, train/validation gap <= 0.04. | Replay passes but monotone layers < 2 or final held-out max rel L2 > 0.14. | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1.v1.json |
| RW1.2 | KILL | Same RW1 held-out gate after trimming the feature search: replay <= 1e-2, all layers monotone, final max rel L2 <= 0.10, gap <= 0.04. | Replay passes but shared-head anchored features still have monotone layers < 2 or final held-out max rel L2 > 0.14. | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p1.v1.json |
| RW1.3 | KILL | Same RW1 held-out gate with unchanged recurrent state size and smoother head-aware parameterization. | Shared basis plus per-head gates still has monotone layers < 2 or final held-out max rel L2 > 0.14. | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p3.v1.json |
| RW1.4 | KILL | Source-side before/after layer mean field plus first-order corrections passes RW1 held-out gate: replay <= 1e-2, all layers monotone, final max rel L2 <= 0.10, gap <= 0.04. | Source-layer mean field still has monotone layers < 2 or final held-out max rel L2 > 0.14; second-order correction audit also fails if layer 8 remains high. | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4.v1.json |
| RW1.4.1 | KILL | Analytic source-interaction expansion around RW1.4 mean field: mean field plus first-order fluctuation response plus second-order pair/correlation correction. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p1.layer8.v1.json |
| RW1.4.2 | KILL | Constrained nonlocal history propagator after RW1.4.1 local source expansion, inspired by a sum-over-histories/action correction but emitted only as a byte-accounted diagnostic. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p2.layer8.v1.json |
| RW1.4.3 | KILL | Finite quadratic action-derived nonlocal history propagator after RW1.4.1 local source expansion. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p3.layer8.v1.json |
| RW1.4.4 | KILL | Noether-style invariant gate on the RW1.4.3 action-derived nonlocal propagator. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p4.layer8.v1.json |
| RW1.4.5 | KILL | Constrained Noether-tangent action basis for the RW1.4.4 action branch: project the correction space before fitting/evaluation instead of auditing leaks only after fitting. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p5.layer8.v1.json |
| RW1.4.6 | KILL | Constrained variational action solve inside the Noether tangent manifold from the start. This is the postmortem opening after RW1.4.5: solve in the constrained manifold, not solve first and project afterward. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p6.layer8.v1.json |
| RW1.4.7 | KILL | Global all-layer source-field diagnostic for the hypothesis that replacing a correlated transformer layer requires a sufficient statistic over the whole source stack, approximating an RG-like fixed point rather than a local source neighborhood. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p7.layer8.v1.json |
| RW1.4.8 | KILL | Oracle full-fidelity all-layer source receptacle: preserve every non-target source layer as layer-resolved recurrent source events instead of compressing them to a mean field. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p4p8.layer8.v1.json |
| RW1.5 | KILL | Layer-8-first cross-fitted structured kernel mixture with fit/guard/locked splits, gap penalty, denominator stability penalty, and Lean GatePass certificate emission. | already passed | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw1p5.layer8.v1.json |
| RW2 | BLOCKED_RW1X_NO_PASS | Decayed carrier remains within +0.015 rel L2 of RW1.4 final max, replay passes, denominators stay stable, and at least two layers are monotone. | Decayed carrier is worse than RW1.4 by >0.05 rel L2 or denominator stability fails. | cross-check/rwkv-receptacle/olmoe_attention_receptacle_probe.rw2.v1.json |
| RW3 | BLOCKED_RW1X_NO_PASS | Single patched layer has mean NLL inflation <= 2.0% and max logit rel L2 <= 0.10. | Mean NLL inflation > 5.0% or max logit rel L2 > 0.25. | cross-check/rwkv-receptacle/olmoe_attention_receptacle_block_swap.rw3.v1.json |
Queue command: uv run python cross-check/rwkv-receptacle/rw_replacement_ladder_queue.py --run --stop RW3. The queue stops on KILL or INCONCLUSIVE: RW2 does not launch unless RW1 passes, and RW3 does not launch unless RW2 passes.
§5 Interactive probes
Three tweening demos. Same scaffold as the V-cache widgets.
These feel like physics demos, not animations: drag, read the β readout, think. Below 500 px the SVG body falls back to a static snapshot.
§6 Lean status
111 theorems proved. 20 in flight.
Load-bearing theorems on the β framework, centering lift, and residual correction — proved in Lean 4 + Mathlib. Full theorem-by-theorem status lives on the Lean page and the theorems appendix; the totals below are pulled live from _data/lean_status.json.
"Deferred" proofs are blocked on a pending Mathlib lemma or a principled analytic sorry. "Stub" entries have a complete statement but an unfinished body. See lean.html for the cross-link matrix between theorems and paper sections.
§7 Paper artifacts
Paper · code · rotations.
BibTeX
@misc{basu2026phasecollapse,
title = {Phase-Collapse Defragmentation: A Moment-Ratio Framework for
1-Bit KV-Cache Quantization in Mixture-of-Experts Transformers},
author = {Basu, Debanjan},
year = {2026},
url = {https://github.com/d3banjan/moe-gauge-paper}
}