Explore the Data
Technical deep-dive into measurements, geometry, and Lean-verified claims. Each widget has a three-line annotation: what it shows, the dominant trend, and the outlier to watch.
Moment geometry
1 — Moment geometry
The moment ratio $\beta = \mu_1^2/\mu_2$ is a scale-invariant, distribution-free summary of how aligned a random vector is with the sign lattice ${\pm\alpha}^d$. For a vector $k$ on the unit sphere, the 1-bit cosine is
proved as cos_sign_quant_formula (lean.md). The residual energy identity $\lVert k - \hat{k} \rVert^2 / \lVert k \rVert^2 = \sin^2\theta_k$ follows from residual_energy_formula (lean.md). Equality $\cos\theta_k = 1$ holds if and only if $k \in {\pm\alpha}^d$ (cos_eq_one_iff_sign_lattice lean.md).
The asymptotic link to $\beta$ comes from the strong law: if coordinates $K_i$ are identically distributed, symmetric, with finite second moment, then $\cos\theta_K \to \mu_1/\sqrt{\mu_2} = \sqrt\beta$ as $d \to \infty$. Reference values: Laplace $\beta = 1/2$; Gaussian $\beta = 2/\pi \approx 0.637$; uniform on $[-1,1]$ $\beta = 3/4$; sign lattice $\beta = 1$.
Distribution shape is a symmetric generalised Gaussian with exponent $p$ chosen so that $\mu_1^2/\mu_2 = \beta$. The chip "floor $1-\beta$" is the asymptotic expected residual energy at 1-bit.
Isotropic floor
2 — Isotropic floor and SRHT
Any data-independent orthogonal rotation acts as a mixing operator on key coordinates. By the Central Limit Theorem, rotated marginals converge toward Gaussian as $d$ grows, driving $\beta$ toward $2/\pi$ and the cosine floor toward $\sqrt{2/\pi} \approx $ 0.7979.
↳ formalized in srht_achieves_isotropic_floor, isotropic_floor_pos, gaussian_moment_ratio_floor_eq
Learned rotation
3 — Learned $L^1$-max rotation
A learned $L^1$-max rotation maximises $\mathbb{E}[\lVert Rk \rVert_1 / (\sqrt{d}\lVert Rk \rVert_2)]$ over a calibration set. With DC-mean centering, it universally lifts β to 0.919–0.965.
↳ formalized in dc_centering_lifts_rotational_ceiling, l2Norm_uncentered_gt
DC mechanism
4 — DC-mean centering mechanism
Centering removes the mean direction from the key before rotation, freeing the optimizer to use the full $O(d)$ group instead of the $O(d-1)$ subgroup that stabilizes the mean axis. The formal ceiling result is:
where $v$ is the centered residual. See dc_centering_lifts_rotational_ceiling for the proof.
dc_centering_lifts_rotational_ceiling
proven
theorem dc_centering_lifts_rotational_ceiling
Centering a key before rotation raises the supremum of achievable β: the optimizer is freed from the O(d-1) subgroup that stabilizes the mean direction.
View in Lean Proofs →Last-mile hardness
5 — Last-mile quadratic hardness
All three models in the convergence scan reach β ≥ 0.85 by step 50, regardless of head dimension. The difference emerges in the last mile.
↳ formalized in beta_last_mile_quadratic_hardness, beta_local_quadratic_expansion, beta_hessian_nonneg
Rank-r correction
6 — Rank-$r$ residual correction
After 1-bit quantization, the per-sample residual still carries significant energy. The effective quality after rank-r correction is $\beta_{\text{eff}} = 1 - (1-\beta_{1b})(1-\gamma_r)$.
beta_eff_formula
proven
theorem beta_eff_formula (β_1b γ_r : ℝ) :
β_eff = 1 − (1 − β₁ᵦ)(1 − γ_r) combines 1-bit alignment with rank-r residual correction into one effective quality metric.
View in Lean Proofs →↳ formalized in beta_eff_formula, svd_topk_maximizes_gammaR, residual_topk_eckart_young
Per-layer profile
8 — Per-layer β profile (Gemma-4 26B MoE)
Pareto frontier
9 — Pareto frontier (Gemma-4 26B MoE)
Each dot is one operating point. The dashed orange curve is the frontier through uniform-k points. The sweet spot k = d/8 is ringed.
↳ formalized in beta_eff_formula, top_k_optimal
Front protection
10 — Front-token protection
FFN weight rotation β transfer (Part E, in flight)
11 — Part E: FFN weight β lift on Gemma-4-26B
Does the learned-rotation β-lift that works on K-cache transfer to FFN weight matrices? A single-layer validator (L10, mlp.{gate,down}_proj) reaches β_learned ∈ [0.916, 0.942], clearing the 0.90 go-threshold. Full pilot V3 then trains per-matrix R on layers [0, 7, 15, 22, 29], five matrix types per layer.
L0 + L7 β table (β_raw → β_batch → β_learned)
| matrix | shape | β_raw | β_batch (10k-row subsample) | β_learned (full) | lift |
|---|---|---|---|---|---|
| L0 mlp_gate_proj | 2112×2816 | 0.780 | 0.941 | 0.941 | +0.16 |
| L0 mlp_up_proj | 2112×2816 | 0.780 | 0.941 | 0.941 | +0.16 |
| L0 mlp_down_proj | 2816×2112 | 0.759 | 0.919 | 0.919 | +0.16 |
| L0 experts_gate_up | 180224×2816 | 0.792 | 0.885 | 0.803 | +0.011 |
| L0 experts_down | 360448×704 | 0.794 | 0.850 | 0.799 | +0.004 |
| L7 mlp_gate_proj | 2112×2816 | 0.773 | 0.942 | 0.942 | +0.17 |
Data: results/part_e_pilot_v3.json. Validator: results/part_e_one_layer_validator.json.
Interpretation
MLP is clean. Each MLP weight matrix has coherent hidden anisotropy that a single learned R captures. β = 0.94 matches the K-cache ceiling after DC-mean centering. This validates end-to-end 1-bit FFN matmul for the dense MLP block with one R per matrix per layer (three 2816×2816 R matrices per layer, 15 total across depth).
Experts are heterogeneous. The batch/full β gap is signal, not noise: R trained on a 10k-row subsample overfits the local sample and fails to generalize to the full 180k/360k-row expert population. This is the same failure mode as the pilot V1 shared-R attempt (where R was shared across all FFN matrix types and collapsed to near-zero lift).
Strategy options for experts:
- Per-expert R: train 128 R-matrices per layer-expert-matrix-type pair (100× storage overhead). Costs 128 × 2816² × fp16 ≈ 200 MB per layer per matrix type.
- Full-matrix training: lift subsample to full rows. Residual materialization at 180k × 2816 × fp32 = 1.9 GB OOMs on 12 GB GPU unless β gradient is chunked.
- Accept experts at FP16: keep MLP at 1-bit + experts at bf16/fp16. Storage win is smaller but the architecture is clean.
Pilot V3 continues on L15, L22, L29 to confirm this pattern across depth. V4 will add CPU-path γ_r recovery for expert matrices via torch.svd_lowrank(residual_cpu, q=8, niter=4).