The Formal Proofs
We proved the load-bearing theorems in a proof assistant. The computer verified every step — for the proven subset.
A proof assistant is a programming language where the compiler checks mathematical arguments, not just syntax. If the code compiles, the theorem is proved — no gaps, no hand-waving. We used Lean 4 with the Mathlib library. The theorems cited inline in the manuscript live in LeanMining/NeuralGeometry/.
Status: loading…
Load-bearing theorems
What the paper’s quantitative claims rest on
iid_coord_moment_ratio_floor
Statement. If the coordinates $K_i$ of a rotated key are identically distributed, symmetric about zero, with finite second moment, and satisfy the strong law of large numbers for $\lvert K_i \rvert$ and $K_i^2$, then
Plain English. The expected residual energy of 1-bit sign quantisation converges to $1 - \beta$ as the head dimension grows. The paper’s entire compounding story rides on this.
Why it matters. No orthogonal rotation can drive the expected residual below $1 - \beta$ for the $\beta$ it induces, so the floor is structural. Breaching it requires a non-orthogonal primitive (the top-$k$ fp16 correction).
Status caveat. The statement and the distributional specialisations (Gaussian/uniform/Laplace) are fully proved. Two sorry placeholders remain in the measure-theoretic plumbing that lifts the SLLN limit from almost-sure to expected value.
gaussian_moment_ratio_floor_eq / uniform_moment_ratio_floor_eq / laplace_moment_ratio_floor_eq
Statement. Closed-form values of $\beta = \mu_1^2 / \mu_2$ for canonical distributions:
- Gaussian $\mathcal{N}(0, \sigma^2)$: $\beta = 2/\pi$
- Uniform on $[-1, 1]$: $\beta = 3/4$
- Laplace scale $b$: $\beta = 1/2$
Plain English. The CLT attractor that data-independent rotations are pinned to is exactly $2/\pi \approx 0.637$. The light-tail limit that a learned $L^1$-max rotation approaches is $3/4$. The worst-case heavy-tail is Laplace at $1/2$.
Why it matters. These three numbers frame the reference table in the paper, define the anti-kurtosis interpretation, and are what the $\beta$-slider interpolates between. All three are proved without sorry.
attention_score_gauge_invariance (MoEGauge.lean)
Statement. Inserting an orthogonal rotation $R$ before $W_K$ and absorbing $R^\top$ into the downstream $W_K$ leaves all attention scores exactly unchanged.
Plain English. The orthogonal rotation $R$ is a gauge degree of freedom of the attention head. Learning a better rotation is a gauge fix, not a change of the model’s function.
Why it matters. This is why the deployment cost of the learned rotation can be driven to zero — it is absorbable into $W_K$ post-training. The key invariance property is machine-checked.
avg_amplification_eq_frob_div_sqrt (WoAmplification.lean)
Statement. For any linear map $W : \mathbb{R}^n \to \mathbb{R}^m$, the root-mean-squared amplification of the standard basis equals $\lVert W \rVert_F / \sqrt{n}$.
Plain English. The output projection $W_o$’s average per-coordinate amplification is exactly its Frobenius norm normalised by head dimension. This is the “average” part of the log-space 19% residual analysis.
Why it matters. The moment-ratio theorem explains 81% of the SRHT–vs–learned log-PPL gap. The remaining 19% is split between softmax and $W_o$; this theorem pins down the $W_o$ contribution analytically.
cos_sign_quant_formula / residual_energy_formula (JensenFloor.lean)
Statement.
Plain English. The per-vector algebraic bridge between the geometric quantity (phase-collapse angle) and the engineering quantity (1-bit recovery error). Both identities are algebraic, independent of any distributional assumption on $k$.
Why it matters. Every distributional claim in the paper starts from one of these two identities. Both are fully proved without sorry.
Glossary
Terms used in the theorems
- Phase-collapse angle $\theta_k$
- The angle between a key vector $k \in \mathbb{R}^d$ and its 1-bit sign-quantised image $\hat{k} = \operatorname{sign}(k) \cdot \alpha$. $\theta_k = 0$ means $k$ lies on the sign lattice; $\theta_k > 44^\circ$ means the horizontal-slice allocator tags the layer as "fragmented" and assigns 3 bits.
- Moment ratio $\beta$
- $\beta = \mu_1^2 / \mu_2$ where $\mu_1 = \mathbb{E}\lvert K_1 \rvert$ and $\mu_2 = \mathbb{E}[K_1^2]$. A scale-invariant, distribution-free summary statistic that completely determines the 1-bit floor under the marginal-identity hypothesis. $\beta \in (0, 1]$, with $\beta = 1$ only on the sign lattice.
- 1-bit floor $1 - \beta$
- The asymptotic expected residual energy $\mathbb{E}[\sin^2 \theta_K]$ under 1-bit sign quantisation of a rotated key $K$ with moment ratio $\beta$. Lower is better; $2/\pi \approx 0.637$ gives a floor of $36.3\%$ (SRHT-unreachable), $3/4$ gives $25\%$ (uniform-limit best), measured MoE gives $29.84\%$.
- Gauge invariance
- The property that an orthogonal rotation $R$ applied before $W_K$ and absorbed into a downstream $W_K$ leaves attention scores exactly unchanged. Makes precise the sense in which $R$ is a free parameter of the attention head.
- Horizontal-slice allocator
- A pre-registered per-layer bit-budget rule: 1 bit if the head is a pressure-release layer ($d_\ell \geq 512$ and angle $\leq 40^\circ$); 2 bits if angle $\leq 44^\circ$ (ballroom); 3 bits otherwise (fragmented). The schedule is purely geometric — no PPL feedback — so it is falsifiable.
- Anti-kurtosis effect
- The observation that a learned $L^1$-max rotation on low-effective-rank MoE activations pushes the moment ratio $\beta$ past the Gaussian CLT attractor toward the uniform-distribution limit $3/4$, producing a lighter-tailed, quantisation-friendlier marginal than any data-independent rotation can.
Full status table
All declarations
Each row is one theorem / lemma / def / axiom across the six LeanMining/NeuralGeometry/*.lean files. Proven means no sorry anywhere in the body. Deferred means the statement is real but the proof body contains a sorry. Stub means : True := sorry — a placeholder reserving a name for future work. Axiom marks imports the file takes on faith (measure-theoretic plumbing not yet formalised in Mathlib).
| Name | Kind | Status | Description |
|---|---|---|---|
| Loading… | |||
proven complete proof, no sorry
deferred real statement, proof body contains sorry
stub True := sorry placeholder
axiom imported axiom