Explore the Data

Technical deep-dive into measurements, geometry, and Lean-verified claims. Each widget has a three-line annotation: what it shows, the dominant trend, and the outlier to watch.


1 — Moment geometry

The moment ratio $\beta = \mu_1^2/\mu_2$ is a scale-invariant, distribution-free summary of how aligned a random vector is with the sign lattice ${\pm\alpha}^d$. For a vector $k$ on the unit sphere, the 1-bit cosine is

$$ \cos\theta_k \;=\; \frac{\lVert k \rVert_1}{\sqrt{d} \cdot \lVert k \rVert_2}, $$

proved as cos_sign_quant_formula (lean.md). The residual energy identity $\lVert k - \hat{k} \rVert^2 / \lVert k \rVert^2 = \sin^2\theta_k$ follows from residual_energy_formula (lean.md). Equality $\cos\theta_k = 1$ holds if and only if $k \in {\pm\alpha}^d$ (cos_eq_one_iff_sign_lattice lean.md).

The asymptotic link to $\beta$ comes from the strong law: if coordinates $K_i$ are identically distributed, symmetric, with finite second moment, then $\cos\theta_K \to \mu_1/\sqrt{\mu_2} = \sqrt\beta$ as $d \to \infty$. Reference values: Laplace $\beta = 1/2$; Gaussian $\beta = 2/\pi \approx 0.637$; uniform on $[-1,1]$ $\beta = 3/4$; sign lattice $\beta = 1$.

↓ formalized in iid_coord_moment_ratio_floor

Distribution shape is a symmetric generalised Gaussian with exponent $p$ chosen so that $\mu_1^2/\mu_2 = \beta$. The chip "floor $1-\beta$" is the asymptotic expected residual energy at 1-bit.


2 — Isotropic floor and SRHT

Any data-independent orthogonal rotation acts as a mixing operator on key coordinates. By the Central Limit Theorem, rotated marginals converge toward Gaussian as $d$ grows, driving $\beta$ toward $2/\pi$ and the cosine floor toward $\sqrt{2/\pi} \approx $ 0.7979.

↓ formalized in srht_achieves_isotropic_floor  •  β scale

↳ formalized in srht_achieves_isotropic_floor, isotropic_floor_pos, gaussian_moment_ratio_floor_eq


3 — Learned $L^1$-max rotation

A learned $L^1$-max rotation maximises $\mathbb{E}[\lVert Rk \rVert_1 / (\sqrt{d}\lVert Rk \rVert_2)]$ over a calibration set. With DC-mean centering, it universally lifts β to 0.919–0.965.

↓ formalized in dc_centering_lifts_rotational_ceiling

↳ formalized in dc_centering_lifts_rotational_ceiling, l2Norm_uncentered_gt


4 — DC-mean centering mechanism

Centering removes the mean direction from the key before rotation, freeing the optimizer to use the full $O(d)$ group instead of the $O(d-1)$ subgroup that stabilizes the mean axis. The formal ceiling result is:

$$ \sup_R \cos\theta(R, v) \;\geq\; \sup_R \cos\theta(R, \mu e + v) $$

where $v$ is the centered residual. See dc_centering_lifts_rotational_ceiling for the proof.

↓ lean.md

dc_centering_lifts_rotational_ceiling proven
theorem dc_centering_lifts_rotational_ceiling

Centering a key before rotation raises the supremum of achievable β: the optimizer is freed from the O(d-1) subgroup that stabilizes the mean direction.

View in Lean Proofs →

5 — Last-mile quadratic hardness

All three models in the convergence scan reach β ≥ 0.85 by step 50, regardless of head dimension. The difference emerges in the last mile.

↓ formalized in beta_last_mile_quadratic_hardness

↳ formalized in beta_last_mile_quadratic_hardness, beta_local_quadratic_expansion, beta_hessian_nonneg


6 — Rank-$r$ residual correction

After 1-bit quantization, the per-sample residual still carries significant energy. The effective quality after rank-r correction is $\beta_{\text{eff}} = 1 - (1-\beta_{1b})(1-\gamma_r)$.

↓ lean.md

beta_eff_formula proven
theorem beta_eff_formula (β_1b γ_r : ℝ) :

β_eff = 1 − (1 − β₁ᵦ)(1 − γ_r) combines 1-bit alignment with rank-r residual correction into one effective quality metric.

View in Lean Proofs →

↳ formalized in beta_eff_formula, svd_topk_maximizes_gammaR, residual_topk_eckart_young


7 — Cross-model PPL and memory

β scale  •  gauge invariance

↓ gauge invariance justifies free-at-deploy property

↳ formalized in attention_score_gauge_invariance


8 — Per-layer β profile (Gemma-4 26B MoE)

↓ iid_coord_moment_ratio_floor


9 — Pareto frontier (Gemma-4 26B MoE)

↓ beta_eff_formula formalizes the frontier shape

Each dot is one operating point. The dashed orange curve is the frontier through uniform-k points. The sweet spot k = d/8 is ringed.

↳ formalized in beta_eff_formula, top_k_optimal


10 — Front-token protection

↓ no direct Lean formalization (empirical only)


11 — Part E: FFN weight β lift on Gemma-4-26B

Does the learned-rotation β-lift that works on K-cache transfer to FFN weight matrices? A single-layer validator (L10, mlp.{gate,down}_proj) reaches β_learned ∈ [0.916, 0.942], clearing the 0.90 go-threshold. Full pilot V3 then trains per-matrix R on layers [0, 7, 15, 22, 29], five matrix types per layer.

↓ formal target: ffn_weight_beta_lift_under_learned_R (open), expert_heterogeneity_batch_full_gap (conjecture)

L0 + L7 β table (β_raw → β_batch → β_learned)

matrix shape β_raw β_batch (10k-row subsample) β_learned (full) lift
L0 mlp_gate_proj 2112×2816 0.780 0.941 0.941 +0.16
L0 mlp_up_proj 2112×2816 0.780 0.941 0.941 +0.16
L0 mlp_down_proj 2816×2112 0.759 0.919 0.919 +0.16
L0 experts_gate_up 180224×2816 0.792 0.885 0.803 +0.011
L0 experts_down 360448×704 0.794 0.850 0.799 +0.004
L7 mlp_gate_proj 2112×2816 0.773 0.942 0.942 +0.17

Data: results/part_e_pilot_v3.json. Validator: results/part_e_one_layer_validator.json.

Interpretation

MLP is clean. Each MLP weight matrix has coherent hidden anisotropy that a single learned R captures. β = 0.94 matches the K-cache ceiling after DC-mean centering. This validates end-to-end 1-bit FFN matmul for the dense MLP block with one R per matrix per layer (three 2816×2816 R matrices per layer, 15 total across depth).

Experts are heterogeneous. The batch/full β gap is signal, not noise: R trained on a 10k-row subsample overfits the local sample and fails to generalize to the full 180k/360k-row expert population. This is the same failure mode as the pilot V1 shared-R attempt (where R was shared across all FFN matrix types and collapsed to near-zero lift).

Strategy options for experts:

  1. Per-expert R: train 128 R-matrices per layer-expert-matrix-type pair (100× storage overhead). Costs 128 × 2816² × fp16 ≈ 200 MB per layer per matrix type.
  2. Full-matrix training: lift subsample to full rows. Residual materialization at 180k × 2816 × fp32 = 1.9 GB OOMs on 12 GB GPU unless β gradient is chunked.
  3. Accept experts at FP16: keep MLP at 1-bit + experts at bf16/fp16. Storage win is smaller but the architecture is clean.

Pilot V3 continues on L15, L22, L29 to confirm this pattern across depth. V4 will add CPU-path γ_r recovery for expert matrices via torch.svd_lowrank(residual_cpu, q=8, niter=4).