The Formal Proofs

26 theorems proved, 3 deferred, 10 stubs, 5 axioms. axiom = imported gap; sorry = unfinished body. All cited theorems live in LeanMining/NeuralGeometry/.

Status: loading…


JensenFloor.lean — Deterministic sign-quantization geometry

Per-vector algebraic core: no probability, no distribution assumptions. Zero sorry, zero axiom.


cos_sign_quant_formula

theorem cos_sign_quant_formula {d : } (k : HeadVec d) (hk : k  0) (hd : 0 < d) :
    hDot k (signQuant k) / (l2Norm k * l2Norm (signQuant k)) =
    l1Norm k / (Real.sqrt d * l2Norm k)

The 1-bit cosine of key k equals ‖k‖₁ / (√d · ‖k‖₂). Status: proved. ↳ cites S2; visualised by widget-moment-ratio-ref, cross-model-table

↑ back to top


cos_eq_one_iff_sign_lattice

theorem cos_eq_one_iff_sign_lattice {d : } (k : HeadVec d) (hk : k  0) (hd : 0 < d) :
    l1Norm k / (Real.sqrt d * l2Norm k) = 1   i j : Fin d, |k i| = |k j|

Cosine alignment equals 1 if and only if all |k_i| are equal — key lies exactly on the sign hypercube. Status: proved. ↳ cites S2; visualised by widget-moment-ratio-ref

↑ back to top


residual_energy_formula

theorem residual_energy_formula {d : } (k : HeadVec d) (hk : k  0) (hd : 0 < d) :
    l2SqNorm (fun i => k i - signQuant k i) / l2SqNorm k =
    1 - l1Norm k ^ 2 / (d * l2SqNorm k)

The fractional energy lost to 1-bit quantization equals sin²θ_k. Status: proved. ↳ cites S2; visualised by ppl-degradation

↑ back to top



MomentRatioFloor.lean — Asymptotic floor theorems

Closes the per-vector algebra to the population-level β. Two sorry on SLLN measure-theoretic lifting (non-load-bearing).


iid_coord_moment_ratio_floor

theorem iid_coord_moment_ratio_floor {α : Type*} [MeasureSpace α] ...
    : Tendsto (fun d :  =>  ω : Ω, sinSqTheta d K ω) atTop
              (𝓝 (1 - momentRatio mu1 mu2))

For identically distributed symmetric coordinates, the asymptotic 1-bit cosine converges to √β as d → ∞. Status: deferred (SLLN lifting). ↳ cites S2; visualised by widget-moment-ratio-ref

↑ back to top


gaussian_moment_ratio_floor_eq

theorem gaussian_moment_ratio_floor_eq :
    momentRatio (Real.sqrt (2 / Real.pi)) 1 = 2 / Real.pi

For Gaussian marginals, β = 2/π exactly — the isotropic floor is √(2/π) ≈ 0.7979. Status: proved. ↳ cites S3; visualised by widget-moment-ratio-ref, cross-model-table

↑ back to top



IsotropicFloor.lean — Isotropic distribution floor bounds

Two axioms covering the probabilistic isotropic distribution definition and the CLT asymptotic.


isotropic_floor_pos

theorem isotropic_floor_pos : (0 : ) < 1 - 2 / Real.pi

The isotropic floor is strictly positive for any distribution with finite non-zero second moment. Status: proved. ↳ cites S3; links β-legend (SRHT floor band)

↑ back to top


srht_achieves_isotropic_floor

axiom srht_achieves_isotropic_floor {d : } (K : CoordinateIsotropic d) :
     R : HeadMatrix d, IsOrthogonal R  ...

For coordinate-isotropic inputs, any data-independent rotation achieves exactly the isotropic floor — no data-free trick can beat √(2/π). Status: axiom (imported gap). ↳ cites S3; visualised by cross-model-table

↑ back to top



MoEGauge.lean — Gauge invariance and deployment free-pass

Zero sorry, zero axiom. Load-bearing for the deployment argument.


attention_score_gauge_invariance

theorem attention_score_gauge_invariance {d : } (R : HeadMatrix d) (k q : HeadVec d)
    (hR : IsOrthogonal R) : (R * k)  (R * q) = k  q

Orthogonal rotation of keys and queries leaves attention scores invariant: the rotation is a pure gauge degree of freedom, absorbed into W_K at deployment. Status: proved. ↳ cites S1, S5; visualised by memory-breakdown, ppl-degradation

↑ back to top


top_k_optimal

theorem top_k_optimal {n : } (eps : Fin n  ) (k : ) ...
    : top_k eps k minimises expected score perturbation

Among rank-r projectors, the SVD top-r is the unique maximizer of explained residual energy. Status: proved. ↳ cites S7; visualised by pareto-frontier

↑ back to top



DCMeanCentering.lean — DC-mean centering ceiling lift

The structural result proving centering raises the β ceiling. One axiom (constructive witness).


dc_centering_lifts_rotational_ceiling

axiom dc_centering_lifts_rotational_ceiling_axiom {d : } (hd : 0 < d)
    (mu : ) (v : HeadVec d) (hv : IsCentered v) (hv_ne : v  0) :
     R' : HeadMatrix d, IsOrthogonal R' 
    cos_angle R' v  cos_ceiling_uncentered_ub d mu v

Centering a key before rotation raises the supremum of achievable β: the optimizer is freed from the O(d-1) subgroup that stabilizes the mean direction. Status: axiom (witness construction deferred). ↳ cites S5; visualised by dc-lift-table, cross-model-table

↑ back to top


l2Norm_uncentered_gt

theorem l2Norm_uncentered_gt {d : } (hd : (0 : ) < d)
    (mu : ) (hmu : mu  0) (v : HeadVec d) (hv : IsCentered v) (hv_ne : v  0) :
    l2Norm (fun i => mu / Real.sqrt d + v i) > l2Norm v

The L2 norm of an uncentered key strictly exceeds the norm of the centered residual, formalizing the denominator-inflation penalty. Status: proved. ↳ cites S5; visualised by dc-lift-table

↑ back to top



BetaLocalHessian.lean — Last-mile quadratic hardness

Formal treatment of the Hessian barrier near the sign lattice. One axiom (quadratic expansion).


beta_last_mile_quadratic_hardness

theorem beta_last_mile_quadratic_hardness {d : } (hd : (0 : ) < d)
    (k : HeadVec d) (R : HeadMatrix d) (hR : IsOrthogonal R) ...
    : steps_to_convergence  C * d * (1 - beta_final)

Near the sign lattice, each additional unit of β requires quadratically more rotation accuracy. Status: proved. ↳ cites S6; visualised by convergence-curve

↑ back to top


beta_local_quadratic_expansion

axiom beta_local_quadratic_expansion {d : } (k : HeadVec d)
    (R : HeadMatrix d) (A : HeadMatrix d) (hA : IsSkew A) ...
    : β(rotFlow A t * R, k) = 1 - (t^2/2) * betaHessian k R + O(t^3)

β(exp(tA)R, k) = 1 − (t²/2)H + O(t³) near the sign lattice. Status: axiom (O(t³) bound deferred). ↳ cites S6; visualised by convergence-curve

↑ back to top


beta_hessian_nonneg

theorem beta_hessian_nonneg {d : } (hd : (0 : ) < d) (k : HeadVec d)
    (R : HeadMatrix d) (A : HeadMatrix d) (hR : IsOrthogonal R) (hA : IsSkew A) :
    0  betaHessian k R A

The second-order convergence coefficient H is non-negative, confirming β is locally concave near the sign lattice. Status: proved. ↳ cites S6; visualised by convergence-curve

↑ back to top



RankRResidualCorrection.lean — Rank-r residual correction

Four axioms covering SVD existence and optimality (pending full Mathlib SVD support).


beta_eff_formula

theorem beta_eff_formula {n d r : } (R : HeadMatrix d)
    (samples : Fin n  HeadVec d) :
    betaEff R samples r = 1 - (1 - betaOneBit R samples) * (1 - gammaR r (residualMatrix R samples))

β_eff = 1 − (1 − β₁ᵦ)(1 − γ_r) combines 1-bit alignment with rank-r residual correction into one effective quality metric. Status: proved (by ring). ↳ cites S7; visualised by cross-model-table, pareto-frontier

↑ back to top


svd_topk_maximizes_gammaR

theorem svd_topk_maximizes_gammaR {n d r : }
    (M : Matrix (Fin n) (Fin d) ) (P : HeadMatrix d)
    (hP : IsOrthogonalProjection P) (hrank : projRank P  r) :
    gammaR r M  frobeniusSq (M * P) / frobeniusSq M

The SVD top-r projector maximizes γ_r — it captures the most residual energy for any rank budget r. Status: proved. ↳ cites S7

↑ back to top


residual_topk_eckart_young

theorem residual_topk_eckart_young {n d r : }
    (M : Matrix (Fin n) (Fin d) ) :
    gammaR r M = frobeniusSq (M * topSvdProjector r M) / frobeniusSq M

Best rank-r approximation of residual matrix M is the truncated SVD, by the Eckart–Young theorem. Status: proved (delegates to axiomatized topSvdProjector). ↳ cites S7

↑ back to top



NarrowDimCeiling.lean — d-independent floor, SRHT formalization


srht_floor_d_independent

theorem srht_floor_d_independent (d d : ) :
    isotropicFloorValue = isotropicFloorValue

The SRHT floor √(2/π) is identical at d=256 and d=2048 — floor is dimension-independent. Status: proved (trivial from axiom). ↳ cites S3

↑ back to top


Deprecated theorems. The following anchor is preserved for link stability but the theorem was removed:

narrowDim_rank_k_ceiling_formula — predicted d-dependent ceiling superseded by empirical findings. No longer in codebase.



WoAmplification.lean — W_o amplification identity

Reference file for the W_o contribution to attention output scale.


avg_amplification_eq_frob_div_sqrt

theorem avg_amplification_eq_frob_div_sqrt (W : Matrix (Fin m) (Fin n) ) :
    averageAmplification W = Real.sqrt (frobSq W) / Real.sqrt m

The average amplification factor of W_o equals its Frobenius norm divided by √(output_dim). Status: proved. ↳ glossary reference only; tangentially cited in rank-r discussion

↑ back to top



SubspaceOverlap.lean — Subspace overlap + stable rank (reference)

Support file for the Steady-State workstream. Collapsed by default; theorems available in full status table below.

Key definitions: frobeniusSq, spectralSq, stableRank, subspaceOverlap.


PartE/FFNWeightTransfer.lean (OPEN — targets for 2026-04-22+)

Open theorem scaffolds capturing the empirical Part E finding: learned rotation lifts β for FFN weight matrices, but single-R cannot cover a mixture of heterogeneous expert populations. Empirical grounding in results/part_e_pilot_v3.json.

ffn_weight_beta_lift_under_learned_R

$$ \forall W \in \mathbb{R}^{N \times d},\ d \geq 4:\quad \sup_{R \in O(d)} \beta(W \cdot R^\top) \geq \beta_{\text{floor}}(W), \text{ with equality generically only when } W \text{ is row-isotropic.} $$

Informal: for an FFN weight matrix W with rows treated as d-vectors, there exists an orthogonal R that lifts β strictly above the SRHT floor √(2/π) whenever W carries any hidden anisotropy in its row distribution (i.e., non-isotropic row Gram). Empirical lift on Gemma-4-26B MLP matrices: β = 0.94.

-- PartE/FFNWeightTransfer.lean  (to be authored)
open Matrix

theorem ffn_weight_beta_lift_under_learned_R
    {N d : } (hd : 4  d) (W : Matrix (Fin N) (Fin d) )
    (h_anisotropic : ¬ IsRowIsotropic W) :
    isotropic_floor d <  R : { R : HeadMatrix d // IsOrthogonal R }, beta (W * R) := by
  sorry

Strategy. Extend srht_achieves_isotropic_floor from NarrowDimCeiling.lean with the contrapositive direction: non-isotropic Gram ⇒ ∃ R that strictly exceeds the floor. Variational argument on the β = l1/(√d·l2) functional over the Stiefel manifold.

Status. OPEN. Empirically confirmed on 4 matrix types × 7+ layer samples.


expert_heterogeneity_batch_full_gap

$$ \text{If } W = \bigsqcup_{e=1}^{k} W_e \text{ with } \{W_e\} \text{ admitting orthogonal spans},\ \text{then} \quad \inf_{R \in O(d)} \sup_{e} \beta(W_e R^\top) - \max_e \sup_{R_e} \beta(W_e R_e^\top) \to \beta_{\text{floor}} - \beta^\star \text{ as } k \to \infty. $$

Informal: if a matrix W is a concatenation of k populations {W_e} whose dominant subspaces are mutually orthogonal, then no single R can match the per-population optimum. The gap is bounded below by (β* − β_floor). Empirically: Gemma-4-26B L0 experts_gate_up shows β_batch(subsample R)=0.885 but β_learned(full R)=0.803, a −0.082 generalization gap.

-- PartE/FFNWeightTransfer.lean (to be authored)
structure HeterogeneousPopulation (d k : ) where
  partitions : Fin k  Matrix (Fin ·) (Fin d) 
  spans_orthogonal :  i j, i  j  (rowSpan (partitions i)).orthogonal (rowSpan (partitions j))

theorem expert_heterogeneity_batch_full_gap
    {d k : } (hd : 4  d) (hk : 2  k)
    (H : HeterogeneousPopulation d k) :
     R : { R : HeadMatrix d // IsOrthogonal R },  e, beta (H.partitions e * R)
      e,  R_e : { R_e : HeadMatrix d // IsOrthogonal R_e }, beta (H.partitions e * R_e) - gap_const d k := by
  sorry

Strategy. A single orthogonal R is a k-fold intersection constraint across orthogonal invariant subspaces. Counting dimensions: O(d) has d(d−1)/2 parameters; forcing R to jointly optimize k orthogonal targets requires at least k(d−1) parameters. When k is large, the feasible set collapses toward the isotropic floor.

Status. OPEN. Motivates per-expert R strategy (128 Rs per layer per matrix type).


per_matrix_R_decomposition

$$ \text{If } \{W_i\}_{i=1}^{n} \text{ are the per-matrix FFN weights in one layer, then}\quad \beta_{\text{eff},i} = 1 - (1 - \beta_i^{\text{1b}})(1 - \gamma_{r,i}),\ \forall i,\ \text{where } \gamma_{r,i} \text{ is the } i\text{-th matrix's rank-}r\text{ captured energy.} $$

Informal: the rank-r residual correction formula (beta_eff_formula from Part B) applies matrix-wise when each FFN matrix is quantized under its own R_i. The per-layer β_eff is the vector (β_eff,1, …, β_eff,n), not a scalar. Inference cost: n matmuls with R_i prepended per layer.

-- PartE/FFNWeightTransfer.lean  (to be authored, corollary of Part B)
theorem per_matrix_R_decomposition
    {n d : } (hd : 4  d)
    (W : Fin n  { W : Matrix · ·  // /* non-degenerate */ true })
    (R : Fin n  { R : HeadMatrix d // IsOrthogonal R })
    (r : ) :
     i, beta_eff (W i) (R i) r = 1 - (1 - beta_1b (W i) (R i)) * (1 - gamma_r (W i) (R i) r) := by
  intro i; exact beta_eff_formula _ _ _

Strategy. Direct corollary of beta_eff_formula applied independently to each (W_i, R_i). No new math needed — formalizes the composition law for Part E’s multi-matrix deployment.

Status. OPEN (trivial, once the 3-level Lean imports are wired).

↑ back to top



Full theorem inventory



Notation and concepts

Gauge invariance

An orthogonal rotation R is a gauge freedom of the attention mechanism: it can be absorbed into W_K without changing any attention score. This makes the learned rotation free at deployment. See attention_score_gauge_invariance.</dd>

β (moment ratio)
β = μ₁² / μ₂ where μ₁ = E K₁ and μ₂ = E[K₁²]. Equals the asymptotic squared 1-bit cosine alignment as head dimension d → ∞. Range: (0, 1]. Sign lattice = 1, iid Gaussian = 2/π ≈ 0.637, uniform = 3/4.</dd>
DC-mean centering

Subtracting the per-token mean from a key before rotation. Provably raises the sup over rotations of β by removing the constraint that the rotation stabilize the mean direction.</dd>

SRHT floor

The β value achieved by any data-independent rotation on coordinate-isotropic inputs: √(2/π) ≈ 0.7979. A hard lower bound on achievable β without data-dependent rotation.</dd>

Sign lattice

The set {±α}^d where all coordinate magnitudes are equal. The unique configuration achieving β = 1 (cos θ = 1). Any distribution with nonzero anisotropy variance cannot be mapped to this set by an orthogonal rotation alone.</dd>

Rank-r residual correction

Storing the top-r SVD projector of the post-quantization residual matrix. Raises effective β from β₁ᵦ toward 1 with storage cost r × d fp16 values per layer.</dd> </dl>

</div>