● appendix · formal statements · 33 load-bearing entries
Theorems.
Every load-bearing result in the paper, with its formal statement, Lean signature, and proof status. Numbers here mirror lean_status.json — the source of truth for the landing page's status panel.
Note on LaTeX statements. The JSON carries Lean signatures and one-sentence natural-language descriptions, not formal LaTeX. The statements below are reconstructions written from those descriptions and the paper's convention (β = μ₁² / μ₂, k ∈ ℝ^d, R ∈ O(d)). For the authoritative Lean source, follow each source link.
● English explanations
What the theorems say, in plain language.
MoE Gauge Invariance
Attention scores are unchanged if you rotate both queries and keys by the same orthogonal matrix. This is "gauge freedom": the neural network doesn't care about the coordinate system you pick for the latent space, so we are free to pick the one that makes 1-bit quantization easiest.
Jensen Floor
The 1-bit cosine β has a closed form: it is the ratio of the L1 norm to the sqrt(d) times L2 norm of a vector. It equals 1 exactly when every coordinate has the same absolute value — the "sign lattice." Any deviation from equal-magnitude coordinates costs you residual energy, and that cost is sin² θ where θ is the angle between the vector and its sign vector.
Moment-Ratio Floor
For high-dimensional i.i.d. coordinates, the 1-bit cosine converges almost surely to the square root of β = (E|X|)² / E[X²]. This is a single-number fingerprint of the entire coordinate distribution. For Gaussians it is 2/π ≈ 0.637, giving the famous √(2/π) ≈ 0.798 floor.
Isotropic Floor
If your key coordinates are already isotropic (like after a random rotation), no data-independent rotation can beat the Gaussian floor. The Central Limit Theorem is a trap: it forces every coordinate toward Gaussianity, and Gaussianity pins β at 0.637.
DC-Mean Centering
Subtracting the mean before rotating frees up an extra degree of freedom. Because the mean vector is collinear with the all-ones vector, removing it lets the rotation act on a (d−1)-dimensional subspace instead of d. The supremum of β over all rotations is always at least as large after centering, and strictly larger whenever the mean is non-zero.
Local Hessian & Last-Mile Hardness
Near the sign lattice, β behaves like a quadratic function of the rotation parameters. The Hessian is non-negative, which means the landscape is locally convex — good for convergence — but the curvature vanishes as you approach β = 1. This creates a "last mile" where each additional 0.01 of β requires quadratically more optimization effort.
Rank-r Residual Correction
If you can't align the whole matrix perfectly, align the top-r singular vectors instead. Truncated SVD gives the optimal rank-r approximation by Eckart-Young. Composing 1-bit quantization with a rank-r residual correction gives an effective β of 1 − (1−β₁b)(1−γ_r), where γ_r is the fraction of Frobenius energy captured by the top-r singular values.
Attention Distortion
Quantizing the K-cache introduces error into attention scores, but the error is controlled. The infinity-norm distortion of the full attention block is bounded by the Lipschitz constant of softmax times the per-coordinate quantization error, plus the V-cache error. On Gemma-4 at d=128, this gives an explicit constant C₁₂₈,₄.
Reading guide. Each card below shows: (1) a one-sentence natural-language summary from the Lean source, (2) a reconstructed LaTeX statement where available, (3) the Lean signature, and (4) crosslinks to paper sections and interactive widgets. Status chips are live from lean_status.json.
● index
By family.
● legend
sorry in this theorem's derivation.sorry.