Explore the Data

Technical deep-dive into measurements, geometry, and Lean-verified claims. Each widget has a three-line annotation: what it shows, the dominant trend, and the outlier to watch.

Moment geometry

1 — Moment geometry

The moment ratio $\beta = \mu_1^2/\mu_2$ is a scale-invariant, distribution-free summary of how aligned a random vector is with the sign lattice ${\pm\alpha}^d$. For a vector $k$ on the unit sphere, the 1-bit cosine is

$$ \cos\theta_k \;=\; \frac{\lVert k \rVert_1}{\sqrt{d} \cdot \lVert k \rVert_2}, $$

proved as cos_sign_quant_formula (lean.md). The residual energy identity $\lVert k - \hat{k} \rVert^2 / \lVert k \rVert^2 = \sin^2\theta_k$ follows from residual_energy_formula (lean.md). Equality $\cos\theta_k = 1$ holds if and only if $k \in {\pm\alpha}^d$ (cos_eq_one_iff_sign_lattice lean.md).

The asymptotic link to $\beta$ comes from the strong law: if coordinates $K_i$ are identically distributed, symmetric, with finite second moment, then $\cos\theta_K \to \mu_1/\sqrt{\mu_2} = \sqrt\beta$ as $d \to \infty$. Reference values: Laplace $\beta = 1/2$; Gaussian $\beta = 2/\pi \approx 0.637$; uniform on $[-1,1]$ $\beta = 3/4$; sign lattice $\beta = 1$.

Distribution shape is a symmetric generalised Gaussian with exponent $p$ chosen so that $\mu_1^2/\mu_2 = \beta$. The chip "floor $1-\beta$" is the asymptotic expected residual energy at 1-bit.

Isotropic floor

2 — Isotropic floor and SRHT

Any data-independent orthogonal rotation acts as a mixing operator on key coordinates. By the Central Limit Theorem, rotated marginals converge toward Gaussian as $d$ grows, driving $\beta$ toward $2/\pi$ and the cosine floor toward $\sqrt{2/\pi} \approx $ 0.7979.

What: Cross-model β audit table showing raw, centered, SRHT, and learned+centered β per architecture, with PPL column and inline sparkline of progression.

Trend: SRHT column is nearly constant at 0.7981–0.7994 across all models and all d; learned+centered column ranges 0.919–0.965.

Outlier: Gemma-4 26B MoE SRHT PPL is 942.42 vs FP16 2.2 — a 36.6× collapse that illustrates the compounding effect of β = 0.80 across 30 layers.

↓ formalized in srht_achieves_isotropic_floor • β scale

↳ formalized in srht_achieves_isotropic_floor, isotropic_floor_pos, gaussian_moment_ratio_floor_eq

Learned rotation

3 — Learned $L^1$-max rotation

A learned $L^1$-max rotation maximises $\mathbb{E}[\lVert Rk \rVert_1 / (\sqrt{d}\lVert Rk \rVert_2)]$ over a calibration set. With DC-mean centering, it universally lifts β to 0.919–0.965.

↳ formalized in dc_centering_lifts_rotational_ceiling, l2Norm_uncentered_gt

DC mechanism

4 — DC-mean centering mechanism

Centering removes the mean direction from the key before rotation, freeing the optimizer to use the full $O(d)$ group instead of the $O(d-1)$ subgroup that stabilizes the mean axis. The formal ceiling result is:

$$ \sup_R \cos\theta(R, v) \;\geq\; \sup_R \cos\theta(R, \mu e + v) $$

where $v$ is the centered residual. See dc_centering_lifts_rotational_ceiling for the proof.

dc_centering_lifts_rotational_ceiling proven

theorem dc_centering_lifts_rotational_ceiling

Centering a key before rotation raises the supremum of achievable β: the optimizer is freed from the O(d-1) subgroup that stabilizes the mean direction.

View in Lean Proofs →

Last-mile hardness

5 — Last-mile quadratic hardness

All three models in the convergence scan reach β ≥ 0.85 by step 50, regardless of head dimension. The difference emerges in the last mile.

↳ formalized in beta_last_mile_quadratic_hardness, beta_local_quadratic_expansion, beta_hessian_nonneg

Rank-r correction

6 — Rank-$r$ residual correction

After 1-bit quantization, the per-sample residual still carries significant energy. The effective quality after rank-r correction is $\beta_{\text{eff}} = 1 - (1-\beta_{1b})(1-\gamma_r)$.

beta_eff_formula proven

theorem beta_eff_formula (β_1b γ_r : ℝ) :

β_eff = 1 − (1 − β₁ᵦ)(1 − γ_r) combines 1-bit alignment with rank-r residual correction into one effective quality metric.

View in Lean Proofs →

↳ formalized in beta_eff_formula, svd_topk_maximizes_gammaR, residual_topk_eckart_young

Cross-model results

7 — Cross-model PPL and memory

↳ formalized in attention_score_gauge_invariance

Per-layer profile

8 — Per-layer β profile (Gemma-4 26B MoE)

What: Per-layer β_empirical profile for Gemma-4 26B MoE across 30 layers, with reference lines at 2/π (CLT attractor), 3/4 (uniform), and the model mean.

Trend: β_empirical tracks closely with the iid-predicted floor; the model mean sits below the uniform reference, confirming near-Gaussian key distributions.

Outlier: Layers with d=512 (four query layers at L05, L11, L17, L23) appear as larger dots — their higher d drives β_empirical slightly lower, consistent with higher-dimensional Gaussian convergence.

↓ iid_coord_moment_ratio_floor

Pareto frontier

9 — Pareto frontier (Gemma-4 26B MoE)

Each dot is one operating point. The dashed orange curve is the frontier through uniform-k points. The sweet spot k = d/8 is ringed.

↳ formalized in beta_eff_formula, top_k_optimal

Front protection

10 — Front-token protection

FFN weight rotation β transfer (Part E, in flight)

11 — Part E: FFN weight β lift on Gemma-4-26B

Does the learned-rotation β-lift that works on K-cache transfer to FFN weight matrices? A single-layer validator (L10, mlp.{gate,down}_proj) reaches β_learned ∈ [0.916, 0.942], clearing the 0.90 go-threshold. Full pilot V3 then trains per-matrix R on layers [0, 7, 15, 22, 29], five matrix types per layer.

What: Per-matrix β on Gemma-4-26B layer 0 + 7. Five matrix types: mlp.{gate,up,down}_proj (dense MLP) and experts.{gate_up,down}_proj (128-way MoE).

Trend: MLP trio lifts β 0.78 → 0.92-0.94 with zero batch/full gap and depth-flat across L0/L7. Clean transfer.

Outlier: Expert matrices show β_batch→β_full divergence of −0.05 to −0.08 — a single shared R per layer cannot cover 128 geometrically heterogeneous experts. Strategy decision pending.

↓ formal target: ffn_weight_beta_lift_under_learned_R (open), expert_heterogeneity_batch_full_gap (conjecture)

L0 + L7 β table (β_raw → β_batch → β_learned)

matrix	shape	β_raw	β_batch (10k-row subsample)	β_learned (full)	lift
L0 mlp_gate_proj	2112×2816	0.780	0.941	0.941	+0.16
L0 mlp_up_proj	2112×2816	0.780	0.941	0.941	+0.16
L0 mlp_down_proj	2816×2112	0.759	0.919	0.919	+0.16
L0 experts_gate_up	180224×2816	0.792	0.885	0.803	+0.011
L0 experts_down	360448×704	0.794	0.850	0.799	+0.004
L7 mlp_gate_proj	2112×2816	0.773	0.942	0.942	+0.17

Data: results/part_e_pilot_v3.json. Validator: results/part_e_one_layer_validator.json.

Interpretation

MLP is clean. Each MLP weight matrix has coherent hidden anisotropy that a single learned R captures. β = 0.94 matches the K-cache ceiling after DC-mean centering. This validates end-to-end 1-bit FFN matmul for the dense MLP block with one R per matrix per layer (three 2816×2816 R matrices per layer, 15 total across depth).

Experts are heterogeneous. The batch/full β gap is signal, not noise: R trained on a 10k-row subsample overfits the local sample and fails to generalize to the full 180k/360k-row expert population. This is the same failure mode as the pilot V1 shared-R attempt (where R was shared across all FFN matrix types and collapsed to near-zero lift).

Strategy options for experts:

Per-expert R: train 128 R-matrices per layer-expert-matrix-type pair (100× storage overhead). Costs 128 × 2816² × fp16 ≈ 200 MB per layer per matrix type.
Full-matrix training: lift subsample to full rows. Residual materialization at 180k × 2816 × fp32 = 1.9 GB OOMs on 12 GB GPU unless β gradient is chunked.
Accept experts at FP16: keep MLP at 1-bit + experts at bf16/fp16. Storage win is smaller but the architecture is clean.

Pilot V3 continues on L15, L22, L29 to confirm this pattern across depth. V4 will add CPU-path γ_r recovery for expert matrices via torch.svd_lowrank(residual_cpu, q=8, niter=4).