· moe-gauge

appendix · formal statements · 33 load-bearing entries

Theorems.

Every load-bearing result in the paper, with its formal statement, Lean signature, and proof status. Numbers here mirror lean_status.json — the source of truth for the landing page's status panel.

schemalean_status_paper2_v2 source commit generated

Note on LaTeX statements. The JSON carries Lean signatures and one-sentence natural-language descriptions, not formal LaTeX. The statements below are reconstructions written from those descriptions and the paper's convention (β = μ₁² / μ₂, k ∈ ℝ^d, R ∈ O(d)). For the authoritative Lean source, follow each source link.

English explanations

What the theorems say, in plain language.

MoE Gauge Invariance

Attention scores are unchanged if you rotate both queries and keys by the same orthogonal matrix. This is "gauge freedom": the neural network doesn't care about the coordinate system you pick for the latent space, so we are free to pick the one that makes 1-bit quantization easiest.

Jensen Floor

The 1-bit cosine β has a closed form: it is the ratio of the L1 norm to the sqrt(d) times L2 norm of a vector. It equals 1 exactly when every coordinate has the same absolute value — the "sign lattice." Any deviation from equal-magnitude coordinates costs you residual energy, and that cost is sin² θ where θ is the angle between the vector and its sign vector.

Moment-Ratio Floor

For high-dimensional i.i.d. coordinates, the 1-bit cosine converges almost surely to the square root of β = (E|X|)² / E[X²]. This is a single-number fingerprint of the entire coordinate distribution. For Gaussians it is 2/π ≈ 0.637, giving the famous √(2/π) ≈ 0.798 floor.

Isotropic Floor

If your key coordinates are already isotropic (like after a random rotation), no data-independent rotation can beat the Gaussian floor. The Central Limit Theorem is a trap: it forces every coordinate toward Gaussianity, and Gaussianity pins β at 0.637.

DC-Mean Centering

Subtracting the mean before rotating frees up an extra degree of freedom. Because the mean vector is collinear with the all-ones vector, removing it lets the rotation act on a (d−1)-dimensional subspace instead of d. The supremum of β over all rotations is always at least as large after centering, and strictly larger whenever the mean is non-zero.

Local Hessian & Last-Mile Hardness

Near the sign lattice, β behaves like a quadratic function of the rotation parameters. The Hessian is non-negative, which means the landscape is locally convex — good for convergence — but the curvature vanishes as you approach β = 1. This creates a "last mile" where each additional 0.01 of β requires quadratically more optimization effort.

Rank-r Residual Correction

If you can't align the whole matrix perfectly, align the top-r singular vectors instead. Truncated SVD gives the optimal rank-r approximation by Eckart-Young. Composing 1-bit quantization with a rank-r residual correction gives an effective β of 1 − (1−β₁b)(1−γ_r), where γ_r is the fraction of Frobenius energy captured by the top-r singular values.

Attention Distortion

Quantizing the K-cache introduces error into attention scores, but the error is controlled. The infinity-norm distortion of the full attention block is bounded by the Lipschitz constant of softmax times the per-coordinate quantization error, plus the V-cache error. On Gemma-4 at d=128, this gives an explicit constant C₁₂₈,₄.

Reading guide. Each card below shows: (1) a one-sentence natural-language summary from the Lean source, (2) a reconstructed LaTeX statement where available, (3) the Lean signature, and (4) crosslinks to paper sections and interactive widgets. Status chips are live from lean_status.json.

● index

By family.

filter
view

● legend

proven Lean proof closed, zero sorry in this theorem's derivation.
deferred Statement compiles; body cites a documented blocker (usually a Mathlib gap).
stub Statement written; proof body is sorry.
axiom Asserted without proof; used as an interface to classical results pending Mathlib support.