The Formal Proofs
26 theorems proved, 3 deferred, 10 stubs, 5 axioms. axiom = imported gap; sorry = unfinished body. All cited theorems live in LeanMining/NeuralGeometry/.
Status: loading…
JensenFloor
JensenFloor.lean — Deterministic sign-quantization geometry
Per-vector algebraic core: no probability, no distribution assumptions. Zero sorry, zero axiom.
cos_sign_quant_formula
theorem cos_sign_quant_formula {d : ℕ} (k : HeadVec d) (hk : k ≠ 0) (hd : 0 < d) :
hDot k (signQuant k) / (l2Norm k * l2Norm (signQuant k)) =
l1Norm k / (Real.sqrt d * l2Norm k)
The 1-bit cosine of key k equals ‖k‖₁ / (√d · ‖k‖₂). Status: proved. ↳ cites S2; visualised by widget-moment-ratio-ref, cross-model-table
cos_eq_one_iff_sign_lattice
theorem cos_eq_one_iff_sign_lattice {d : ℕ} (k : HeadVec d) (hk : k ≠ 0) (hd : 0 < d) :
l1Norm k / (Real.sqrt d * l2Norm k) = 1 ↔ ∀ i j : Fin d, |k i| = |k j|
Cosine alignment equals 1 if and only if all |k_i| are equal — key lies exactly on the sign hypercube. Status: proved. ↳ cites S2; visualised by widget-moment-ratio-ref
residual_energy_formula
theorem residual_energy_formula {d : ℕ} (k : HeadVec d) (hk : k ≠ 0) (hd : 0 < d) :
l2SqNorm (fun i => k i - signQuant k i) / l2SqNorm k =
1 - l1Norm k ^ 2 / (d * l2SqNorm k)
The fractional energy lost to 1-bit quantization equals sin²θ_k. Status: proved. ↳ cites S2; visualised by ppl-degradation
MomentRatioFloor
MomentRatioFloor.lean — Asymptotic floor theorems
Closes the per-vector algebra to the population-level β. Two sorry on SLLN measure-theoretic lifting (non-load-bearing).
iid_coord_moment_ratio_floor
theorem iid_coord_moment_ratio_floor {α : Type*} [MeasureSpace α] ...
: Tendsto (fun d : ℕ => ∫ ω : Ω, sinSqTheta d K ω) atTop
(𝓝 (1 - momentRatio mu1 mu2))
For identically distributed symmetric coordinates, the asymptotic 1-bit cosine converges to √β as d → ∞. Status: deferred (SLLN lifting). ↳ cites S2; visualised by widget-moment-ratio-ref
gaussian_moment_ratio_floor_eq
theorem gaussian_moment_ratio_floor_eq :
momentRatio (Real.sqrt (2 / Real.pi)) 1 = 2 / Real.pi
For Gaussian marginals, β = 2/π exactly — the isotropic floor is √(2/π) ≈ 0.7979. Status: proved. ↳ cites S3; visualised by widget-moment-ratio-ref, cross-model-table
IsotropicFloor
IsotropicFloor.lean — Isotropic distribution floor bounds
Two axioms covering the probabilistic isotropic distribution definition and the CLT asymptotic.
isotropic_floor_pos
theorem isotropic_floor_pos : (0 : ℝ) < 1 - 2 / Real.pi
The isotropic floor is strictly positive for any distribution with finite non-zero second moment. Status: proved. ↳ cites S3; links β-legend (SRHT floor band)
srht_achieves_isotropic_floor
axiom srht_achieves_isotropic_floor {d : ℕ} (K : CoordinateIsotropic d) :
∀ R : HeadMatrix d, IsOrthogonal R → ...
For coordinate-isotropic inputs, any data-independent rotation achieves exactly the isotropic floor — no data-free trick can beat √(2/π). Status: axiom (imported gap). ↳ cites S3; visualised by cross-model-table
MoEGauge
MoEGauge.lean — Gauge invariance and deployment free-pass
Zero sorry, zero axiom. Load-bearing for the deployment argument.
attention_score_gauge_invariance
theorem attention_score_gauge_invariance {d : ℕ} (R : HeadMatrix d) (k q : HeadVec d)
(hR : IsOrthogonal R) : (R *ᵥ k) ⬝ (R *ᵥ q) = k ⬝ q
Orthogonal rotation of keys and queries leaves attention scores invariant: the rotation is a pure gauge degree of freedom, absorbed into W_K at deployment. Status: proved. ↳ cites S1, S5; visualised by memory-breakdown, ppl-degradation
top_k_optimal
theorem top_k_optimal {n : ℕ} (eps : Fin n → ℝ) (k : ℕ) ...
: top_k eps k minimises expected score perturbation
Among rank-r projectors, the SVD top-r is the unique maximizer of explained residual energy. Status: proved. ↳ cites S7; visualised by pareto-frontier
DCMeanCentering
DCMeanCentering.lean — DC-mean centering ceiling lift
The structural result proving centering raises the β ceiling. One axiom (constructive witness).
dc_centering_lifts_rotational_ceiling
axiom dc_centering_lifts_rotational_ceiling_axiom {d : ℕ} (hd : 0 < d)
(mu : ℝ) (v : HeadVec d) (hv : IsCentered v) (hv_ne : v ≠ 0) :
∃ R' : HeadMatrix d, IsOrthogonal R' ∧
cos_angle R' v ≥ cos_ceiling_uncentered_ub d mu v
Centering a key before rotation raises the supremum of achievable β: the optimizer is freed from the O(d-1) subgroup that stabilizes the mean direction. Status: axiom (witness construction deferred). ↳ cites S5; visualised by dc-lift-table, cross-model-table
l2Norm_uncentered_gt
theorem l2Norm_uncentered_gt {d : ℕ} (hd : (0 : ℝ) < d)
(mu : ℝ) (hmu : mu ≠ 0) (v : HeadVec d) (hv : IsCentered v) (hv_ne : v ≠ 0) :
l2Norm (fun i => mu / Real.sqrt d + v i) > l2Norm v
The L2 norm of an uncentered key strictly exceeds the norm of the centered residual, formalizing the denominator-inflation penalty. Status: proved. ↳ cites S5; visualised by dc-lift-table
BetaLocalHessian
BetaLocalHessian.lean — Last-mile quadratic hardness
Formal treatment of the Hessian barrier near the sign lattice. One axiom (quadratic expansion).
beta_last_mile_quadratic_hardness
theorem beta_last_mile_quadratic_hardness {d : ℕ} (hd : (0 : ℝ) < d)
(k : HeadVec d) (R : HeadMatrix d) (hR : IsOrthogonal R) ...
: steps_to_convergence ≥ C * d * (1 - beta_final)
Near the sign lattice, each additional unit of β requires quadratically more rotation accuracy. Status: proved. ↳ cites S6; visualised by convergence-curve
beta_local_quadratic_expansion
axiom beta_local_quadratic_expansion {d : ℕ} (k : HeadVec d)
(R : HeadMatrix d) (A : HeadMatrix d) (hA : IsSkew A) ...
: β(rotFlow A t * R, k) = 1 - (t^2/2) * betaHessian k R + O(t^3)
β(exp(tA)R, k) = 1 − (t²/2)H + O(t³) near the sign lattice. Status: axiom (O(t³) bound deferred). ↳ cites S6; visualised by convergence-curve
beta_hessian_nonneg
theorem beta_hessian_nonneg {d : ℕ} (hd : (0 : ℝ) < d) (k : HeadVec d)
(R : HeadMatrix d) (A : HeadMatrix d) (hR : IsOrthogonal R) (hA : IsSkew A) :
0 ≤ betaHessian k R A
The second-order convergence coefficient H is non-negative, confirming β is locally concave near the sign lattice. Status: proved. ↳ cites S6; visualised by convergence-curve
RankRResidualCorrection
RankRResidualCorrection.lean — Rank-r residual correction
Four axioms covering SVD existence and optimality (pending full Mathlib SVD support).
beta_eff_formula
theorem beta_eff_formula {n d r : ℕ} (R : HeadMatrix d)
(samples : Fin n → HeadVec d) :
betaEff R samples r = 1 - (1 - betaOneBit R samples) * (1 - gammaR r (residualMatrix R samples))
β_eff = 1 − (1 − β₁ᵦ)(1 − γ_r) combines 1-bit alignment with rank-r residual correction into one effective quality metric. Status: proved (by ring).
↳ cites S7; visualised by cross-model-table, pareto-frontier
svd_topk_maximizes_gammaR
theorem svd_topk_maximizes_gammaR {n d r : ℕ}
(M : Matrix (Fin n) (Fin d) ℝ) (P : HeadMatrix d)
(hP : IsOrthogonalProjection P) (hrank : projRank P ≤ r) :
gammaR r M ≥ frobeniusSq (M * P) / frobeniusSq M
The SVD top-r projector maximizes γ_r — it captures the most residual energy for any rank budget r. Status: proved. ↳ cites S7
residual_topk_eckart_young
theorem residual_topk_eckart_young {n d r : ℕ}
(M : Matrix (Fin n) (Fin d) ℝ) :
gammaR r M = frobeniusSq (M * topSvdProjector r M) / frobeniusSq M
Best rank-r approximation of residual matrix M is the truncated SVD, by the Eckart–Young theorem. Status: proved (delegates to axiomatized topSvdProjector).
↳ cites S7
NarrowDimCeiling
NarrowDimCeiling.lean — d-independent floor, SRHT formalization
srht_floor_d_independent
theorem srht_floor_d_independent (d₁ d₂ : ℕ) :
isotropicFloorValue = isotropicFloorValue
The SRHT floor √(2/π) is identical at d=256 and d=2048 — floor is dimension-independent. Status: proved (trivial from axiom). ↳ cites S3
Deprecated theorems. The following anchor is preserved for link stability but the theorem was removed:
narrowDim_rank_k_ceiling_formula — predicted d-dependent ceiling superseded by empirical findings. No longer in codebase.
WoAmplification
WoAmplification.lean — W_o amplification identity
Reference file for the W_o contribution to attention output scale.
avg_amplification_eq_frob_div_sqrt
theorem avg_amplification_eq_frob_div_sqrt (W : Matrix (Fin m) (Fin n) ℝ) :
averageAmplification W = Real.sqrt (frobSq W) / Real.sqrt m
The average amplification factor of W_o equals its Frobenius norm divided by √(output_dim). Status: proved. ↳ glossary reference only; tangentially cited in rank-r discussion
SubspaceOverlap
SubspaceOverlap.lean — Subspace overlap + stable rank (reference)
Support file for the Steady-State workstream. Collapsed by default; theorems available in full status table below.
Key definitions: frobeniusSq, spectralSq, stableRank, subspaceOverlap.
PartE/FFNWeightTransfer.lean (OPEN — targets for 2026-04-22+)
Open theorem scaffolds capturing the empirical Part E finding: learned rotation lifts β for FFN weight matrices, but single-R cannot cover a mixture of heterogeneous expert populations. Empirical grounding in results/part_e_pilot_v3.json.
ffn_weight_beta_lift_under_learned_R
Informal: for an FFN weight matrix W with rows treated as d-vectors, there exists an orthogonal R that lifts β strictly above the SRHT floor √(2/π) whenever W carries any hidden anisotropy in its row distribution (i.e., non-isotropic row Gram). Empirical lift on Gemma-4-26B MLP matrices: β = 0.94.
-- PartE/FFNWeightTransfer.lean (to be authored)
open Matrix
theorem ffn_weight_beta_lift_under_learned_R
{N d : ℕ} (hd : 4 ≤ d) (W : Matrix (Fin N) (Fin d) ℝ)
(h_anisotropic : ¬ IsRowIsotropic W) :
isotropic_floor d < ⨆ R : { R : HeadMatrix d // IsOrthogonal R }, beta (W * Rᵀ) := by
sorry
Strategy. Extend srht_achieves_isotropic_floor from NarrowDimCeiling.lean with the contrapositive direction: non-isotropic Gram ⇒ ∃ R that strictly exceeds the floor. Variational argument on the β = l1/(√d·l2) functional over the Stiefel manifold.
Status. OPEN. Empirically confirmed on 4 matrix types × 7+ layer samples.
expert_heterogeneity_batch_full_gap
Informal: if a matrix W is a concatenation of k populations {W_e} whose dominant subspaces are mutually orthogonal, then no single R can match the per-population optimum. The gap is bounded below by (β* − β_floor). Empirically: Gemma-4-26B L0 experts_gate_up shows β_batch(subsample R)=0.885 but β_learned(full R)=0.803, a −0.082 generalization gap.
-- PartE/FFNWeightTransfer.lean (to be authored)
structure HeterogeneousPopulation (d k : ℕ) where
partitions : Fin k → Matrix (Fin ·) (Fin d) ℝ
spans_orthogonal : ∀ i j, i ≠ j → (rowSpan (partitions i)).orthogonal (rowSpan (partitions j))
theorem expert_heterogeneity_batch_full_gap
{d k : ℕ} (hd : 4 ≤ d) (hk : 2 ≤ k)
(H : HeterogeneousPopulation d k) :
⨆ R : { R : HeadMatrix d // IsOrthogonal R }, ⨅ e, beta (H.partitions e * Rᵀ)
≤ ⨅ e, ⨆ R_e : { R_e : HeadMatrix d // IsOrthogonal R_e }, beta (H.partitions e * R_eᵀ) - gap_const d k := by
sorry
Strategy. A single orthogonal R is a k-fold intersection constraint across orthogonal invariant subspaces. Counting dimensions: O(d) has d(d−1)/2 parameters; forcing R to jointly optimize k orthogonal targets requires at least k(d−1) parameters. When k is large, the feasible set collapses toward the isotropic floor.
Status. OPEN. Motivates per-expert R strategy (128 Rs per layer per matrix type).
per_matrix_R_decomposition
Informal: the rank-r residual correction formula (beta_eff_formula from Part B) applies matrix-wise when each FFN matrix is quantized under its own R_i. The per-layer β_eff is the vector (β_eff,1, …, β_eff,n), not a scalar. Inference cost: n matmuls with R_i prepended per layer.
-- PartE/FFNWeightTransfer.lean (to be authored, corollary of Part B)
theorem per_matrix_R_decomposition
{n d : ℕ} (hd : 4 ≤ d)
(W : Fin n → { W : Matrix · · ℝ // /* non-degenerate */ true })
(R : Fin n → { R : HeadMatrix d // IsOrthogonal R })
(r : ℕ) :
∀ i, beta_eff (W i) (R i) r = 1 - (1 - beta_1b (W i) (R i)) * (1 - gamma_r (W i) (R i) r) := by
intro i; exact beta_eff_formula _ _ _
Strategy. Direct corollary of beta_eff_formula applied independently to each (W_i, R_i). No new math needed — formalizes the composition law for Part E’s multi-matrix deployment.
Status. OPEN (trivial, once the 3-level Lean imports are wired).
Full status table
Full theorem inventory
Cross-link matrix
Theorem → widget → paper cross-link matrix
Glossary
Notation and concepts
- Gauge invariance
-
An orthogonal rotation R is a gauge freedom of the attention mechanism: it can be absorbed into W_K without changing any attention score. This makes the learned rotation free at deployment. See
attention_score_gauge_invariance.</dd> - β (moment ratio)
-
β = μ₁² / μ₂ where μ₁ = E K₁ and μ₂ = E[K₁²]. Equals the asymptotic squared 1-bit cosine alignment as head dimension d → ∞. Range: (0, 1]. Sign lattice = 1, iid Gaussian = 2/π ≈ 0.637, uniform = 3/4.</dd> - DC-mean centering
-
Subtracting the per-token mean from a key before rotation. Provably raises the sup over rotations of β by removing the constraint that the rotation stabilize the mean direction.</dd>
- SRHT floor
-
The β value achieved by any data-independent rotation on coordinate-isotropic inputs: √(2/π) ≈ 0.7979. A hard lower bound on achievable β without data-dependent rotation.</dd>
- Sign lattice
-
The set {±α}^d where all coordinate magnitudes are equal. The unique configuration achieving β = 1 (cos θ = 1). Any distribution with nonzero anisotropy variance cannot be mapped to this set by an orthogonal rotation alone.</dd>
- Rank-r residual correction
-
Storing the top-r SVD projector of the post-quantization residual matrix. Raises effective β from β₁ᵦ toward 1 with storage cost r × d fp16 values per layer.</dd> </dl>
</div>