The Formal Proofs

Machine-checked in Lean 4 + Mathlib

230 theorems proved, 12 deferred, 4 stubs, 6 axioms. axiom = imported gap; sorry = unfinished body. All cited theorems live in LeanMining/NeuralGeometry/.

Status: loading…

JensenFloor

`JensenFloor.lean` — Deterministic sign-quantization geometry

Per-vector algebraic core: no probability, no distribution assumptions. Zero sorry, zero axiom.

cos_sign_quant_formula

theorem cos_sign_quant_formula {d : ℕ} (k : HeadVec d) (hk : k ≠ 0) (hd : 0 < d) :
    hDot k (signQuant k) / (l2Norm k * l2Norm (signQuant k)) =
    l1Norm k / (Real.sqrt d * l2Norm k)

The 1-bit cosine of key k equals ‖k‖₁ / (√d · ‖k‖₂). Status: proved. ↳ cites S2; visualised by widget-moment-ratio-ref, cross-model-table

↑ back to top

cos_eq_one_iff_sign_lattice

theorem cos_eq_one_iff_sign_lattice {d : ℕ} (k : HeadVec d) (hk : k ≠ 0) (hd : 0 < d) :
    l1Norm k / (Real.sqrt d * l2Norm k) = 1 ↔ ∀ i j : Fin d, |k i| = |k j|

Cosine alignment equals 1 if and only if all |k_i| are equal — key lies exactly on the sign hypercube. Status: proved. ↳ cites S2; visualised by widget-moment-ratio-ref

↑ back to top

residual_energy_formula

theorem residual_energy_formula {d : ℕ} (k : HeadVec d) (hk : k ≠ 0) (hd : 0 < d) :
    l2SqNorm (fun i => k i - signQuant k i) / l2SqNorm k =
    1 - l1Norm k ^ 2 / (d * l2SqNorm k)

The fractional energy lost to 1-bit quantization equals sin²θ_k. Status: proved. ↳ cites S2; visualised by ppl-degradation

↑ back to top

MomentRatioFloor

`MomentRatioFloor.lean` — Asymptotic floor theorems

Closes the per-vector algebra to the population-level β. Two sorry on SLLN measure-theoretic lifting (non-load-bearing).

iid_coord_moment_ratio_floor

theorem iid_coord_moment_ratio_floor {α : Type*} [MeasureSpace α] ...
    : Tendsto (fun d : ℕ => ∫ ω : Ω, sinSqTheta d K ω) atTop
              (𝓝 (1 - momentRatio mu1 mu2))

For identically distributed symmetric coordinates, the asymptotic 1-bit cosine converges to √β as d → ∞. Status: deferred (SLLN lifting). ↳ cites S2; visualised by widget-moment-ratio-ref

↑ back to top

gaussian_moment_ratio_floor_eq

theorem gaussian_moment_ratio_floor_eq :
    momentRatio (Real.sqrt (2 / Real.pi)) 1 = 2 / Real.pi

For Gaussian marginals, β = 2/π exactly — the isotropic floor is √(2/π) ≈ 0.7979. Status: proved. ↳ cites S3; visualised by widget-moment-ratio-ref, cross-model-table

↑ back to top

IsotropicFloor

`IsotropicFloor.lean` — Isotropic distribution floor bounds

Two axioms covering the probabilistic isotropic distribution definition and the CLT asymptotic.

isotropic_floor_pos

theorem isotropic_floor_pos : (0 : ℝ) < 1 - 2 / Real.pi

The isotropic floor is strictly positive for any distribution with finite non-zero second moment. Status: proved. ↳ cites S3; links β-legend (SRHT floor band)

↑ back to top

srht_achieves_isotropic_floor

axiom srht_achieves_isotropic_floor {d : ℕ} (K : CoordinateIsotropic d) :
    ∀ R : HeadMatrix d, IsOrthogonal R → ...

For coordinate-isotropic inputs, any data-independent rotation achieves exactly the isotropic floor — no data-free trick can beat √(2/π). Status: axiom (imported gap). ↳ cites S3; visualised by cross-model-table

↑ back to top

MoEGauge

`MoEGauge.lean` — Gauge invariance and deployment free-pass

Zero sorry, zero axiom. Load-bearing for the deployment argument.

attention_score_gauge_invariance

theorem attention_score_gauge_invariance {d : ℕ} (R : HeadMatrix d) (k q : HeadVec d)
    (hR : IsOrthogonal R) : (R *ᵥ k) ⬝ (R *ᵥ q) = k ⬝ q

Orthogonal rotation of keys and queries leaves attention scores invariant: the rotation is a pure gauge degree of freedom, absorbed into W_K at deployment. Status: proved. ↳ cites S1, S5; visualised by memory-breakdown, ppl-degradation

↑ back to top

top_k_optimal

theorem top_k_optimal {n : ℕ} (eps : Fin n → ℝ) (k : ℕ) ...
    : top_k eps k minimises expected score perturbation

Among rank-r projectors, the SVD top-r is the unique maximizer of explained residual energy. Status: proved. ↳ cites S7; visualised by pareto-frontier

↑ back to top

DCMeanCentering

`DCMeanCentering.lean` — DC-mean centering ceiling lift

The structural result proving centering raises the β ceiling. One axiom (constructive witness).

dc_centering_lifts_rotational_ceiling

axiom dc_centering_lifts_rotational_ceiling_axiom {d : ℕ} (hd : 0 < d)
    (mu : ℝ) (v : HeadVec d) (hv : IsCentered v) (hv_ne : v ≠ 0) :
    ∃ R' : HeadMatrix d, IsOrthogonal R' ∧
    cos_angle R' v ≥ cos_ceiling_uncentered_ub d mu v

Centering a key before rotation raises the supremum of achievable β: the optimizer is freed from the O(d-1) subgroup that stabilizes the mean direction. Status: axiom (witness construction deferred). ↳ cites S5; visualised by dc-lift-table, cross-model-table

↑ back to top

l2Norm_uncentered_gt

theorem l2Norm_uncentered_gt {d : ℕ} (hd : (0 : ℝ) < d)
    (mu : ℝ) (hmu : mu ≠ 0) (v : HeadVec d) (hv : IsCentered v) (hv_ne : v ≠ 0) :
    l2Norm (fun i => mu / Real.sqrt d + v i) > l2Norm v

The L2 norm of an uncentered key strictly exceeds the norm of the centered residual, formalizing the denominator-inflation penalty. Status: proved. ↳ cites S5; visualised by dc-lift-table

↑ back to top

BetaLocalHessian

`BetaLocalHessian.lean` — Last-mile quadratic hardness

Formal treatment of the Hessian barrier near the sign lattice. One axiom (quadratic expansion).

beta_last_mile_quadratic_hardness

theorem beta_last_mile_quadratic_hardness {d : ℕ} (hd : (0 : ℝ) < d)
    (k : HeadVec d) (R : HeadMatrix d) (hR : IsOrthogonal R) ...
    : steps_to_convergence ≥ C * d * (1 - beta_final)

Near the sign lattice, each additional unit of β requires quadratically more rotation accuracy. Status: proved. ↳ cites S6; visualised by convergence-curve

↑ back to top

beta_local_quadratic_expansion

theorem beta_local_quadratic_expansion {d : ℕ} (hd : 1 ≤ d)
    (k : HeadVec d) (hk : k ≠ 0)
    (R : HeadMatrix d) (hR : IsOrthogonal R)
    (h_lattice : ∀ i j, |(R *ᵥ k) i| = |(R *ᵥ k) j|)
    (A : HeadMatrix d) (hA : IsSkew A) :
    ∀ᶠ t in 𝓝 (0 : ℝ),
      |rotationCos (rotFlow A t * R) k -
       (1 - t^2/2 * betaHessian k R A)| ≤ |t|^3

β(exp(tA)R, k) = 1 − (t²/2)·H + O(t³) near the sign lattice. Closure 2026-04-28: prior versions had a spurious α factor; the corrected polynomial drops the α and is fully proven via taylor_within_apply on the scalar function g(t) = ⟨s, exp(tA)·y⟩/(d·α). Odd Taylor coefficients vanish under skew-A + sign-lattice; the second derivative computes to −betaHessian directly. Status: proved (zero sorry, zero axiom). ↳ cites S6; visualised by convergence-curve

↑ back to top

beta_hessian_nonneg

theorem beta_hessian_nonneg {d : ℕ} (hd : (0 : ℝ) < d) (k : HeadVec d)
    (R : HeadMatrix d) (A : HeadMatrix d) (hR : IsOrthogonal R) (hA : IsSkew A) :
    0 ≤ betaHessian k R A

The second-order convergence coefficient H is non-negative, confirming β is locally concave near the sign lattice. Status: proved. ↳ cites S6; visualised by convergence-curve

↑ back to top

RankRResidualCorrection

`RankRResidualCorrection.lean` — Rank-r residual correction

Four axioms covering SVD existence and optimality (pending full Mathlib SVD support).

beta_eff_formula

theorem beta_eff_formula {n d r : ℕ} (R : HeadMatrix d)
    (samples : Fin n → HeadVec d) :
    betaEff R samples r = 1 - (1 - betaOneBit R samples) * (1 - gammaR r (residualMatrix R samples))

β_eff = 1 − (1 − β₁ᵦ)(1 − γ_r) combines 1-bit alignment with rank-r residual correction into one effective quality metric. Status: proved (by ring). ↳ cites S7; visualised by cross-model-table, pareto-frontier

↑ back to top

svd_topk_maximizes_gammaR

theorem svd_topk_maximizes_gammaR {n d r : ℕ}
    (M : Matrix (Fin n) (Fin d) ℝ) (P : HeadMatrix d)
    (hP : IsOrthogonalProjection P) (hrank : projRank P ≤ r) :
    gammaR r M ≥ frobeniusSq (M * P) / frobeniusSq M

The SVD top-r projector maximizes γ_r — it captures the most residual energy for any rank budget r. Status: proved. ↳ cites S7

↑ back to top

residual_topk_eckart_young

theorem residual_topk_eckart_young {n d r : ℕ}
    (M : Matrix (Fin n) (Fin d) ℝ) :
    gammaR r M = frobeniusSq (M * topSvdProjector r M) / frobeniusSq M

Best rank-r approximation of residual matrix M is the truncated SVD, by the Eckart–Young theorem. Status: proved (delegates to axiomatized topSvdProjector). ↳ cites S7

↑ back to top

NarrowDimCeiling

`NarrowDimCeiling.lean` — d-independent floor, SRHT formalization

srht_floor_d_independent

theorem srht_floor_d_independent (d₁ d₂ : ℕ) :
    isotropicFloorValue = isotropicFloorValue

The SRHT floor √(2/π) is identical at d=256 and d=2048 — floor is dimension-independent. Status: proved (trivial from axiom). ↳ cites S3

↑ back to top

Deprecated theorems. The following anchor is preserved for link stability but the theorem was removed:

narrowDim_rank_k_ceiling_formula — predicted d-dependent ceiling superseded by empirical findings. No longer in codebase.

WoAmplification

`WoAmplification.lean` — W_o amplification identity

Reference file for the W_o contribution to attention output scale.

avg_amplification_eq_frob_div_sqrt

theorem avg_amplification_eq_frob_div_sqrt (W : Matrix (Fin m) (Fin n) ℝ) :
    averageAmplification W = Real.sqrt (frobSq W) / Real.sqrt m

The average amplification factor of W_o equals its Frobenius norm divided by √(output_dim). Status: proved. ↳ glossary reference only; tangentially cited in rank-r discussion

↑ back to top

SubspaceOverlap

`SubspaceOverlap.lean` — Subspace overlap + stable rank (reference)

Support file for the Steady-State workstream. Collapsed by default; theorems available in full status table below.

Key definitions: frobeniusSq, spectralSq, stableRank, subspaceOverlap.

`PartE/FFNWeightTransfer.lean` (OPEN — targets for 2026-04-22+)

Open theorem scaffolds capturing the empirical Part E finding: learned rotation lifts β for FFN weight matrices, but single-R cannot cover a mixture of heterogeneous expert populations. Empirical grounding in results/part_e_pilot_v3.json.

ffn_weight_beta_lift_under_learned_R

$$ \forall W \in \mathbb{R}^{N \times d},\ d \geq 4:\quad \sup_{R \in O(d)} \beta(W \cdot R^\top) \geq \beta_{\text{floor}}(W), \text{ with equality generically only when } W \text{ is row-isotropic.} $$

Informal: for an FFN weight matrix W with rows treated as d-vectors, there exists an orthogonal R that lifts β strictly above the SRHT floor √(2/π) whenever W carries any hidden anisotropy in its row distribution (i.e., non-isotropic row Gram). Empirical lift on Gemma-4-26B MLP matrices: β = 0.94.

-- PartE/FFNWeightTransfer.lean  (to be authored)
open Matrix

theorem ffn_weight_beta_lift_under_learned_R
    {N d : ℕ} (hd : 4 ≤ d) (W : Matrix (Fin N) (Fin d) ℝ)
    (h_anisotropic : ¬ IsRowIsotropic W) :
    isotropic_floor d < ⨆ R : { R : HeadMatrix d // IsOrthogonal R }, beta (W * Rᵀ) := by
  sorry

Strategy. Extend srht_achieves_isotropic_floor from NarrowDimCeiling.lean with the contrapositive direction: non-isotropic Gram ⇒ ∃ R that strictly exceeds the floor. Variational argument on the β = l1/(√d·l2) functional over the Stiefel manifold.

Status. OPEN. Empirically confirmed on 4 matrix types × 7+ layer samples.

expert_heterogeneity_batch_full_gap

$$ \text{If } W = \bigsqcup_{e=1}^{k} W_e \text{ with } \{W_e\} \text{ admitting orthogonal spans},\ \text{then} \quad \inf_{R \in O(d)} \sup_{e} \beta(W_e R^\top) - \max_e \sup_{R_e} \beta(W_e R_e^\top) \to \beta_{\text{floor}} - \beta^\star \text{ as } k \to \infty. $$

Informal: if a matrix W is a concatenation of k populations {W_e} whose dominant subspaces are mutually orthogonal, then no single R can match the per-population optimum. The gap is bounded below by (β* − β_floor). Empirically: Gemma-4-26B L0 experts_gate_up shows β_batch(subsample R)=0.885 but β_learned(full R)=0.803, a −0.082 generalization gap.

-- PartE/FFNWeightTransfer.lean (to be authored)
structure HeterogeneousPopulation (d k : ℕ) where
  partitions : Fin k → Matrix (Fin ·) (Fin d) ℝ
  spans_orthogonal : ∀ i j, i ≠ j → (rowSpan (partitions i)).orthogonal (rowSpan (partitions j))

theorem expert_heterogeneity_batch_full_gap
    {d k : ℕ} (hd : 4 ≤ d) (hk : 2 ≤ k)
    (H : HeterogeneousPopulation d k) :
    ⨆ R : { R : HeadMatrix d // IsOrthogonal R }, ⨅ e, beta (H.partitions e * Rᵀ)
    ≤ ⨅ e, ⨆ R_e : { R_e : HeadMatrix d // IsOrthogonal R_e }, beta (H.partitions e * R_eᵀ) - gap_const d k := by
  sorry

Strategy. A single orthogonal R is a k-fold intersection constraint across orthogonal invariant subspaces. Counting dimensions: O(d) has d(d−1)/2 parameters; forcing R to jointly optimize k orthogonal targets requires at least k(d−1) parameters. When k is large, the feasible set collapses toward the isotropic floor.

Status. OPEN. Motivates per-expert R strategy (128 Rs per layer per matrix type).

per_matrix_R_decomposition

$$ \text{If } \{W_i\}_{i=1}^{n} \text{ are the per-matrix FFN weights in one layer, then}\quad \beta_{\text{eff},i} = 1 - (1 - \beta_i^{\text{1b}})(1 - \gamma_{r,i}),\ \forall i,\ \text{where } \gamma_{r,i} \text{ is the } i\text{-th matrix's rank-}r\text{ captured energy.} $$

Informal: the rank-r residual correction formula (beta_eff_formula from Part B) applies matrix-wise when each FFN matrix is quantized under its own R_i. The per-layer β_eff is the vector (β_eff,1, …, β_eff,n), not a scalar. Inference cost: n matmuls with R_i prepended per layer.

-- PartE/FFNWeightTransfer.lean  (to be authored, corollary of Part B)
theorem per_matrix_R_decomposition
    {n d : ℕ} (hd : 4 ≤ d)
    (W : Fin n → { W : Matrix · · ℝ // /* non-degenerate */ true })
    (R : Fin n → { R : HeadMatrix d // IsOrthogonal R })
    (r : ℕ) :
    ∀ i, beta_eff (W i) (R i) r = 1 - (1 - beta_1b (W i) (R i)) * (1 - gamma_r (W i) (R i) r) := by
  intro i; exact beta_eff_formula _ _ _

Strategy. Direct corollary of beta_eff_formula applied independently to each (W_i, R_i). No new math needed — formalizes the composition law for Part E’s multi-matrix deployment.

Status. OPEN (trivial, once the 3-level Lean imports are wired).

↑ back to top

Full status table

Full theorem inventory

Cross-link matrix

Theorem	File	Widgets	Paper sections
`lemma2_shannon_lb`	ShannonLB.lean
`shannon_quantization_lower_bound_explicit`	LowerBounds.lean
`uniformQuantizerMSE_eq`	UniformQuantizer.lean
`theorem1`	Theorem1.lean
`rotorquant_same_bound`	RotorQuant.lean
`theorem2_distortion`	Theorem2.lean
`attention_quality_d128_b4`	Theorem2.lean
`softmax_lipschitz`	SoftmaxLipschitz.lean
`softmax_stability`	SoftmaxLipschitz.lean
`kv_cache_distortion_theorem`	AttentionDistortion.lean
`kv_cache_distortion_gemma4`	AttentionDistortion.lean
`srht_coord_gauss_approx`	SRHT.lean
`coordDensity_le_sup`	BetaMarginal.lean
`ApproximateMergeLikeOn.monotone`	ApproximateMerge.lean
`orbit_invariance_implies_block_diagonal`	OrbitProjection.lean
`parabolicStabilizer`	ParabolicStabilizer.lean
`OrbitCoset`	OrbitCoset.lean
`beta_row_of_matrix`	FFNGauge.lean
`learned_rotation_W_lift`	FFNGauge.lean
`srht_W_achieves_isotropic_floor`	FFNGauge.lean
`structure_bonus_nonneg`	NarrowDimCeiling.lean
`structure_bonus_olmoe_ge_gemma`	NarrowDimCeiling.lean
`concentration_limited_lower_bound`	NarrowDimCeiling.lean
`frobeniusVsActivationBound`	ActivationFitBound.lean
`isotropic_equality`	ActivationFitBound.lean
`anisotropic_dilution`	ActivationFitBound.lean
`weight_fit_implies_activation_fit`	ActivationFitBound.lean
`beta_local_quadratic_expansion`	BetaLocalHessian.lean
`static_student_no_runtime_choice`	StaticDistillation.lean
`static_distillation_activation_error_eq`	StaticDistillation.lean
`static_distillation_frobenius_bound`	StaticDistillation.lean
`low_rank_factor_bytes`	TensorFactorization.lean
`additive_residual_error_triangle`	TensorFactorization.lean
`tensor_factor_frobenius_to_activation`	TensorFactorization.lean
`certified_distillation_bounds_all_static_inference`	TensorialDistillationError.lean
`frob_certificate_implies_distillation_certificate`	TensorialDistillationError.lean
`rank_r_certificate_implies_distillation_certificate`	TensorialDistillationError.lean
`two_layer_error_bound`	LayerStackDistillation.lean
`finite_stack_error_bound`	LayerStackDistillation.lean
`uniform_stack_error_bound`	LayerStackDistillation.lean
`uniform_stack_error_bound_geometric`	LayerStackDistillation.lean
`uniform_stack_geometric_closed_form`	LayerStackDistillation.lean
`uniform_stack_error_upper_bound`	LayerStackDistillation.lean
`one_bit_compounding_exceeds_gate`	LayerStackDistillation.lean
`one_bit_kill_signal_corollary`	LayerStackDistillation.lean
`shared_codebook_saves_when_K_lt_experts`	StaticCompressionBudget.lean
`static_student_bytes_lt_dense`	StaticCompressionBudget.lean
`shared_basis_layer_frobenius_bound`	TensorFactorization.lean
`shared_basis_stack_error_additive`	TensorFactorization.lean
`shared_basis_stack_error_orthogonal`	TensorFactorization.lean
`shared_basis_compression_saves`	TensorFactorization.lean
`restricted_family_cannot_meet_target`	TensorFactorization.lean
`olmoe_joint_basis_no_low_distortion`	TensorFactorization.lean
`mode_one_total_energy`	TensorFactorization.lean
`independent_layers_spectral_norm`	TensorFactorization.lean
`tucker_mode1_spectral_lower_bound`	TensorFactorization.lean
`tucker_mode0_spectral_lower_bound`	TensorFactorization.lean
`olmoe_tucker_layer_no_compression`	TensorFactorization.lean
`olmoe_tucker_expert_no_compression`	TensorFactorization.lean
`olmoe_hosvd_joint_no_compression`	TensorFactorization.lean
`olmoe_cross_mode_max_singular_value_sq`	TensorFactorization.lean
`olmoe_tucker_cross_mode_no_compression`	TensorFactorization.lean
`invertible_gauge_preserves_rank_family`	CompressionLadder.lean
`effective_weight_gram_eq`	CompressionLadder.lean
`same_precision_residual_not_free`	CompressionLadder.lean
`activation_weighted_error_le_frobenius`	DeploymentCorrectness.lean
`activation_weighted_error_le_spectral`	DeploymentCorrectness.lean
`activation_error_depends_only_on_data_subspace`	DeploymentCorrectness.lean
`linear_gauge_foldable_between_layers`	DeploymentCorrectness.lean
`gap_preserved_under_perturbation`	DeploymentCorrectness.lean
`topk_route_stability`	DeploymentCorrectness.lean
`activation_error_zero_on_data_subspace`	DeploymentCorrectness.lean
`qk_predicate_logit_delta_bound`	QKPredicateAxis.lean
`qk_predicate_logit_bound_nonneg`	QKPredicateAxis.lean
`qk_predicate_attention_softmax_delta_bound`	QKPredicateAxis.lean
`qk_predicate_attention_output_delta_bound`	QKPredicateAxis.lean
`qk_router_readout_shift_le_rowL1`	QKPredicateAxis.lean
`qk_router_logit_shift_bound`	QKPredicateAxis.lean
`qk_predicate_router_shift_bound_target`	QKPredicateAxis.lean
`qk_category_promotable_requires_epsilon_safe`	QKPredicateAxis.lean
`qk_category_promotable_requires_null_margin`	QKPredicateAxis.lean
`qk_category_promotable_requires_clean_witness`	QKPredicateAxis.lean
`qk_category_promote_of_safe_margin_clean`	QKPredicateAxis.lean
`qk_category_unsafe_not_promotable`	QKPredicateAxis.lean
`qk_category_raw_gain_alone_insufficient`	QKPredicateAxis.lean
`qk_category_null_margin_failure_not_promotable`	QKPredicateAxis.lean
`qk_category_clean_witness_failure_not_promotable`	QKPredicateAxis.lean
`hard_qkov_run_decomposes`	QKOVChart.lean
`hard_qkov_same_selector_same_ov`	QKOVChart.lean
`qkov_head_output_decomposes`	QKOVChart.lean
`qkov_same_qk_and_ov_same_output`	QKOVChart.lean
`qkov_chart_zero_presence_not_pass`	QKOVChart.lean
`qkov_family_zero_presence_kill`	QKOVChart.lean
`qkov_no_overlap_no_common_candidate`	QKOVChart.lean
`qkov_self_probe_escape_of_outside_candidate`	QKOVChart.lean
`qkov_fixed_chart_kill_does_not_rule_out_self_probe`	QKOVChart.lean
`qkov_self_probe_sae_level_of_candidate`	QKOVChart.lean
`coefficient_support_truncation_eq`	CompressionLadder.lean
`coefficient_tail_error_bound`	CompressionLadder.lean
`global_delta_bytes_lt_per_task`	StaticCompressionBudget.lean
`global_delta_bytes_lt_per_task_with_overhead`	StaticCompressionBudget.lean
`global_plus_delta_error_bound`	SubspaceOverlap.lean
`task_union_basis_contains_task_basis`	TensorFactorization.lean
`pareto_dominance_transitive`	StaticCompressionBudget.lean
`headVecMetric`	StackCompression.lean
`l2Norm_iterate_le`	StackCompression.lean
`stack_iterate_approximation`	StackCompression.lean
`recursive_shared_weight_exists`	StackCompression.lean
`contractive_shared_weight_bounded_error`	StackCompression.lean
`readNormalized_compile`	RWKVReceptacle.lean
`target_error_le_receptacle_of_feature_residual`	RWKVReceptacle.lean
`linearReadout_receptacle_error_le_rowL1`	RWKVReceptacle.lean
`eval_mean_error_le_train_mean_error_add_gap`	RWKVGeneralization.lean
`pointwise_readout_error_le_rowL1_train_mean_add_gap`	RWKVGeneralization.lean
`eval_mean_readout_error_le_rowL1_train_mean_add_gap`	RWKVGeneralization.lean
`eval_mean_readout_error_le_rowL1_of_eval_train_gap`	RWKVGeneralization.lean
`heldout_le_train_add_gamma_of_gap_bound`	RWKVCertificates.lean
`rw3_blocked_without_rw2_passed`	RWKVQueueSafety.lean
`not_gate_pass_of_gap_failure`	RWKVFailureClasses.lean
`not_correction_improves_of_heldout_regression`	RWKVCorrectionOrder.lean
`secondOrder_heldout_le_meanField_of_pass`	RWKVSourceExpansion.lean
`SourceExpansionPhaseASeedPass.secondOrder_gap_le_firstOrder`	RWKVSourceExpansion.lean
`nonlocal_gap_le_local_of_pass`	RWKVNonlocalExpansion.lean
`not_nonlocalSafeForRW2_without_nonlocal_pass`	RWKVNonlocalExpansion.lean
`action_stationarity_residual_le_tolerance_of_pass`	RWKVActionPropagator.lean
`not_actionNonlocalSafeForRW2_without_action_pass`	RWKVActionPropagator.lean
`rowStochastic_add_zeroRowMassCorrection`	RWKVNoetherInvariants.lean
`noether_leak_le_tolerance_of_pass`	RWKVNoetherInvariants.lean
`not_noetherActionNonlocalSafeForRW2_without_noether_pass`	RWKVNoetherInvariants.lean
`noetherInvariantPass_of_wellFormed_and_toleranceBound`	RWKVNoetherInvariants.lean
`noetherInvariantPass_of_constrained`	RWKVNoetherInvariants.lean
`not_noetherActionNonlocalSafeForRW2_of_clean_without_nonlocal_pass`	RWKVNoetherInvariants.lean
`not_noetherActionNonlocalSafeForRW2_of_clean_without_gate_pass`	RWKVNoetherInvariants.lean
`local_mean_blind_full_source_exact`	RWKVGlobalSource.lean
`error_le_geom_sum`	RWKVStability.lean
`error_le_geom_closed`	RWKVStability.lean
`affine_error_le_geom_sum`	RWKVStability.lean
`affine_error_le_closed`	RWKVStability.lean
`affine_state_bound_geom_sum`	RWKVStability.lean
`affine_state_bound_closed`	RWKVStability.lean
`quotient_lipschitz`	RWKVStability.lean
`combine_assoc`	RWKVStability.lean
`block_error_le_geom_sum`	RWKVStability.lean
`block_error_le_closed`	RWKVStability.lean
`affine_error_le_closed_per_channel`	RWKVStability.lean
`affine_error_linf_le_per_channel_envelope`	RWKVStability.lean
`affine_error_le_closed_per_channel_sup`	RWKVStability.lean
`affine_error_l2_le_per_channel_envelope`	RWKVStability.lean
`attention_output_split`	RWKVStability.lean
`hybrid_output_error_bound`	RWKVStability.lean
`sparse_smooth_mass_decomposition`	RWKVStability.lean
`LowBudgetSmoothClosurePass`	RWKVStability.lean
`LowBudgetGlobalG10Pass`	RWKVStability.lean
`not_lowBudgetSmoothClosurePass_of_gate_failure`	RWKVStability.lean
`lowBudgetGlobalG10_kill_certificate_of_smooth_gate_failure`	RWKVStability.lean
`lowBudgetGlobalG10_kill_certificate_of_routing_gate_failure`	RWKVStability.lean
`LayerFlowRealizesNext.apply`	LayerFlowGenerator.lean
`outputPresence_eq_zero_of_zero_coupling`	LayerFlowGenerator.lean
`outputGated_iff_zero_coupling`	LayerFlowGenerator.lean
`layerFlow_window_transport_criterion`	LayerFlowGenerator.lean
`sourceEffect_eq_readout_transport_source`	LayerFlowSteerability.lean
`readoutNullSource_effect_eq_baseline`	LayerFlowSteerability.lean
`not_crossesSteeringThreshold_of_effect_lt`	LayerFlowSteerability.lean
`crossesSteeringThreshold_of_threshold_le_effect`	LayerFlowSteerability.lean
`inverseSource_exists_of_witness`	LayerFlowSteerability.lean
`inverseSource_crosses_threshold_of_target_ge`	LayerFlowSteerability.lean
`not_beatsNullBy_of_candidate_lt_null_plus_margin`	LayerFlowSteerability.lean
`beatsNullBy_of_null_plus_margin_le_candidate`	LayerFlowSteerability.lean
`toy_sourceEffect_linear_decomposition`	ToyLayerFlow.lean
`toy_baselineEffect`	ToyLayerFlow.lean
`toy_readoutNull_iff_kappa_zero`	ToyLayerFlow.lean
`toy_crossesSteeringThreshold_iff`	ToyLayerFlow.lean
`toy_inverseSourceWitness_linear`	ToyLayerFlow.lean
`toy_beatsNullBy_iff`	ToyLayerFlow.lean
`toy_outputPresence_eq_zero_of_zero_readout`	ToyLayerFlow.lean
`toy_outputGated_of_zero_readout`	ToyLayerFlow.lean
`selectivityRatio_pos`	SelectivityCap.lean
`selectivityRatio_mono_in_alpha`	SelectivityCap.lean
`selectivityRatio_gt_cap_iff`	SelectivityCap.lean
`no_jointly_feasible_of_coherence_violates_selectivity`	SelectivityCap.lean
`not_jointlyFeasible_of_excess`	SelectivityCap.lean
`jointlyFeasible_of_coherent_subcap`	SelectivityCap.lean
`shrunk_zero_implies_kappa_zero`	HaarConcentration.lean
`shrunk_mono_in_eps`	HaarConcentration.lean
`source_coupling_bounded_of_shrunk`	HaarConcentration.lean
`not_shrunk_iff_witness`	HaarConcentration.lean
`shrunk_comp_right`	HaarConcentration.lean
`bestScore_le_of_subset`	SubsetFamilyClosure.lean
`no_subset_member_beats_superset`	SubsetFamilyClosure.lean
`bestScore_attained`	SubsetFamilyClosure.lean
`strict_loss_iff_excluded_witness`	SubsetFamilyClosure.lean
`pythagoras_decomposition`	EckartYoungResidual.lean
`residual_sq_le_norm_sq`	EckartYoungResidual.lean
`residual_sq_ge_of_projected_short`	EckartYoungResidual.lean
`residual_eq_norm_of_kernel`	EckartYoungResidual.lean
`kill_iff_witness`	EckartYoungResidual.lean
`family_residual_lower_bound`	EckartYoungResidual.lean
`subspace_project_complement_le_norm`	SubspaceOverlap.lean
`stableRank_implies_low_dim_approximation`	SubspaceOverlap.lean
`entropy_nonneg`	SoftmaxEntropySlack.lean
`entropy_le_log_card`	SoftmaxEntropySlack.lean
`perplexity_ge_one`	SoftmaxEntropySlack.lean
`perplexity_le_card`	SoftmaxEntropySlack.lean
`entropy_uniform`	SoftmaxEntropySlack.lean
`perplexity_uniform`	SoftmaxEntropySlack.lean
`entropy_delta`	SoftmaxEntropySlack.lean
`perplexity_delta`	SoftmaxEntropySlack.lean
`attentionSlackRatio_ge_one`	SoftmaxEntropySlack.lean
`attentionSlackRatio_le_card`	SoftmaxEntropySlack.lean
`attentionSlackRatio_uniform_eq_one`	SoftmaxEntropySlack.lean
`attentionSlackRatio_delta_eq_card`	SoftmaxEntropySlack.lean
`bias_decomposition`	BiasGenDecoupling.lean
`gen_decomposition`	BiasGenDecoupling.lean
`classK_decoupled_iff_kappa_split`	BiasGenDecoupling.lean
`classK_realised_of_kappa_gen_zero`	BiasGenDecoupling.lean
`equal_readouts_collapses`	BiasGenDecoupling.lean
`falsifier_negates_decoupling`	BiasGenDecoupling.lean
`decoupled_at_zero_gen_iff_kappa_gen_zero`	BiasGenDecoupling.lean
`inverse_source_under_decoupled`	BiasGenDecoupling.lean
`cross_readout_response_ratio`	BiasGenDecoupling.lean
`cross_readout_budget_at_bias_saturation`	BiasGenDecoupling.lean
`cross_corpus_budget_violated_when_kappa_ratio_exceeds_headroom`	BiasGenDecoupling.lean
`prod_one_add_le_pow_one_add_max`	DepthCompositionInflation.lean
`composed_inflation_bound`	DepthCompositionInflation.lean
`kills_iff_pow_exceeds`	DepthCompositionInflation.lean
`per_layer_budget_clears_composed`	DepthCompositionInflation.lean
`noGenerationWitness_blocksTransport`	SteerabilityPostmortemRouting.lean
`noGenerationWitness_blocksInverseSource`	SteerabilityPostmortemRouting.lean
`biasGenDecoupled_blocksFixedReadoutSteering`	SteerabilityPostmortemRouting.lean
`linearRegimeFailure_blocksKappaInterpretation`	SteerabilityPostmortemRouting.lean
`readoutTrapHalt_isMethodologyNotMechanism`	SteerabilityPostmortemRouting.lean
`reopenRequiresTrapCleanGenerationWitness`	SteerabilityPostmortemRouting.lean
`ablation_monotonically_reaches_base`	RoleGateReachability.lean
`logit_mono`	RoleGateReachability.lean
`logit_full_ablation`	RoleGateReachability.lean
`logit_le_base`	RoleGateReachability.lean
`beta_hessian_nonneg`	BetaLocalHessian.lean	widget-convergence-curve	s6
`beta_last_mile_quadratic_hardness`	BetaLocalHessian.lean	widget-convergence-curve	s6
`l2Norm_uncentered_gt`	DCMeanCentering.lean	widget-dc-lift-table	s5
`dc_centering_lifts_rotational_ceiling`	DCMeanCentering.lean	widget-dc-lift-table widget-cross-model-table	s5
`isotropic_floor_pos`	IsotropicFloor.lean	widget-cross-model-table	s3
`cos_sign_quant_formula`	JensenFloor.lean	widget-cross-model-table widget-moment-ratio-ref	s2
`cos_eq_one_iff_sign_lattice`	JensenFloor.lean	widget-moment-ratio-ref	s2
`residual_energy_formula`	JensenFloor.lean	widget-ppl-degradation	s2
`attention_score_gauge_invariance`	MoEGauge.lean	widget-memory-breakdown widget-ppl-degradation	s1 s5
`top_k_optimal`	MoEGauge.lean	widget-cross-model-table	s7
`iid_coord_moment_ratio_floor`	MomentRatioFloor.lean	widget-moment-ratio-ref widget-cross-model-table	s2
`gaussian_moment_ratio_floor_eq`	MomentRatioFloor.lean	widget-moment-ratio-ref widget-cross-model-table	s3
`srht_achieves_isotropic_floor`	NarrowDimCeiling.lean	widget-cross-model-table	s3
`residual_topk_eckart_young`	RankRResidualCorrection.lean	widget-cross-model-table	s7
`beta_eff_formula`	RankRResidualCorrection.lean	widget-cross-model-table	s7
`svd_topk_maximizes_gammaR`	RankRResidualCorrection.lean	widget-cross-model-table	s7

Glossary

Notation and concepts

Gauge invariance

An orthogonal rotation R is a gauge freedom of the attention mechanism: it can be absorbed into W_K without changing any attention score. This makes the learned rotation free at deployment. See attention_score_gauge_invariance.</dd>

β (moment ratio)

β = μ₁² / μ₂ where μ₁ = E

K₁

and μ₂ = E[K₁²]. Equals the asymptotic squared 1-bit cosine alignment as head dimension d → ∞. Range: (0, 1]. Sign lattice = 1, iid Gaussian = 2/π ≈ 0.637, uniform = 3/4.</dd>

DC-mean centering

Subtracting the per-token mean from a key before rotation. Provably raises the sup over rotations of β by removing the constraint that the rotation stabilize the mean direction.</dd>

SRHT floor

The β value achieved by any data-independent rotation on coordinate-isotropic inputs: √(2/π) ≈ 0.7979. A hard lower bound on achievable β without data-dependent rotation.</dd>

Sign lattice

The set {±α}^d where all coordinate magnitudes are equal. The unique configuration achieving β = 1 (cos θ = 1). Any distribution with nonzero anisotropy variance cannot be mapped to this set by an orthogonal rotation alone.</dd>

Rank-r residual correction

Storing the top-r SVD projector of the post-quantization residual matrix. Raises effective β from β₁ᵦ toward 1 with storage cost r × d fp16 values per layer.</dd> </dl>

</div>

The Formal Proofs

JensenFloor.lean — Deterministic sign-quantization geometry

cos_sign_quant_formula

cos_eq_one_iff_sign_lattice

residual_energy_formula

MomentRatioFloor.lean — Asymptotic floor theorems

iid_coord_moment_ratio_floor

gaussian_moment_ratio_floor_eq

IsotropicFloor.lean — Isotropic distribution floor bounds

isotropic_floor_pos

srht_achieves_isotropic_floor

MoEGauge.lean — Gauge invariance and deployment free-pass

attention_score_gauge_invariance

top_k_optimal

DCMeanCentering.lean — DC-mean centering ceiling lift

dc_centering_lifts_rotational_ceiling

l2Norm_uncentered_gt

BetaLocalHessian.lean — Last-mile quadratic hardness

beta_last_mile_quadratic_hardness

beta_local_quadratic_expansion

beta_hessian_nonneg

RankRResidualCorrection.lean — Rank-r residual correction

beta_eff_formula

svd_topk_maximizes_gammaR

residual_topk_eckart_young

NarrowDimCeiling.lean — d-independent floor, SRHT formalization

srht_floor_d_independent

WoAmplification.lean — W_o amplification identity

avg_amplification_eq_frob_div_sqrt

SubspaceOverlap.lean — Subspace overlap + stable rank (reference)

PartE/FFNWeightTransfer.lean (OPEN — targets for 2026-04-22+)

ffn_weight_beta_lift_under_learned_R

expert_heterogeneity_batch_full_gap

per_matrix_R_decomposition

Full theorem inventory

Theorem → widget → paper cross-link matrix

Notation and concepts

`JensenFloor.lean` — Deterministic sign-quantization geometry

`MomentRatioFloor.lean` — Asymptotic floor theorems

`IsotropicFloor.lean` — Isotropic distribution floor bounds

`MoEGauge.lean` — Gauge invariance and deployment free-pass

`DCMeanCentering.lean` — DC-mean centering ceiling lift

`BetaLocalHessian.lean` — Last-mile quadratic hardness

`RankRResidualCorrection.lean` — Rank-r residual correction

`NarrowDimCeiling.lean` — d-independent floor, SRHT formalization

`WoAmplification.lean` — W_o amplification identity

`SubspaceOverlap.lean` — Subspace overlap + stable rank (reference)

`PartE/FFNWeightTransfer.lean` (OPEN — targets for 2026-04-22+)