The Geometry of Self-Supervised AI Learning

November 17th, 2025

Variance-Invariance-Covariance Regularization (VICReg) stabilizes representations through regularization. Self-Distillation with No Labels (DINO) uses momentum encoders and centering. Simple Framework for Contrastive Learning of Representations (SimCLR) demands massive batches and contrastive losses. Bootstrap Your Own Latent (BYOL) removes negatives entirely. Joint-Embedding Predictive Architecture (JEPA) predicts ten steps ahead and discovers stable structure unavailable at shorter horizons.

These methods feel unrelated, as though each inhabits its own conceptual island. VICReg researchers tuned λ25,μ25,ν1\lambda \approx 25, \mu \approx 25, \nu \approx 1 through empirical search. DINO found momentum m=0.996m = 0.996 worked where m=0.99m = 0.99 collapsed. SimCLR scaled to batch size 4096 because smaller batches failed. JEPA discovered that 10-step prediction crossed a stability threshold that 5-step prediction couldn’t reach.

But they all work the same way. Each method traces the geometry of a system seeking coherence within structural constraints. The “magic numbers” aren’t empirical accidents—they’re expressions of geometric necessity. When you treat self-supervised models as physical systems minimizing free energy under representational limits, the convergence stops being mysterious. The math shows why these methods had to converge, and what that convergence reveals about representation learning itself.

This matters because the machine learning industry spends billions of dollars and millions of GPU-hours on hyperparameter searches that could be replaced by geometric calculations. Research teams run months-long ablations testing momentum values, batch sizes, and regularization strengths—searching empirically for numbers that geometric constraints already determine.

When 70% of SSL training runs end in representational collapse, and when leading labs burn compute budgets equivalent to small countries’ GDPs iterating toward “magic numbers” they don’t understand, the economic and scientific cost is staggering. Understanding constraint geometry means designing systems that work from first principles rather than expensive trial-and-error. More critically, it reveals which architectural choices will fail before you build them, and why attempts to “improve” working methods by adjusting their magic numbers almost always make performance worse.


The Constraint Problem Every SSL Method Faces

Any system that learns representations confronts the same fundamental problem: maintain internal coherence while adapting to new information, all while operating under constraints that cannot be removed.

The architecture imposes dimensionality limits. The optimization algorithm introduces inductive biases. The compute budget restricts capacity. The data structure creates statistical dependencies. These structural constants shape what representations are possible.

From physics, we know systems maintain a variational encoding q(x)q(x) that evolves by minimizing free energy,

F[q]=Eq[lnq(x)lnp(o,x)].F[q] = \mathbb{E}_{q}[\ln q(x) - \ln p(o,x)].

This quantity measures how well the system’s internal model matches the generative structure of its environment. The minimum occurs when q(x)=p(xo)q(x) = p(x|o)—perfect posterior inference.

But no real system can represent arbitrary models. Constraints restrict the allowable encodings to a subset,

Mallowed={p(o,x)architectural constraints satisfied}.\mathcal{M}_{\text{allowed}} = \{p(o,x) \mid \text{architectural constraints satisfied}\}.

Within this constrained space lies an optimal point pp^*,

p=argminpMallowedF[p].p^* = \arg\min_{p \in \mathcal{M}_{\text{allowed}}} F[p].

Because constraints warp the geometry of representational space, this constrained optimum sits above the ideal unconstrained minimum. The offset is inevitable,

κ=F[p]F.\kappa = F[p^*] - F^*.

This is the system’s structural constant—the free-energy cost of living inside a particular architecture. Change the depth, width, objective, regularization, or batch size, and κ\kappa changes with it.

The system’s current deviation from the constrained optimum is

CD(t)=F[qt]F[p],\mathrm{CD}(t) = F[q_t] - F[p^*],

called the coherence deviation. The full picture becomes

F[qt]F=κ+CD(t).F[q_t] - F^* = \kappa + \mathrm{CD}(t).

Everything that happens during self-supervised training unfolds inside this equation. Every method shapes κ\kappa through architectural choices. Every training dynamic evolves CD(t)\mathrm{CD}(t) through gradient flow. Every collapse mode emerges when this sum grows too large.

All systems face this constraint problem. The remarkable finding: they all solve it the same way.


The Variance Path: How VICReg Shapes Constraint Geometry

VICReg makes its constraint structure explicit through three loss terms:

L=λLinv+μLvar+νLcov.\mathcal{L} = \lambda \mathcal{L}_{\text{inv}} + \mu \mathcal{L}_{\text{var}} + \nu \mathcal{L}_{\text{cov}}.

The invariance term Linv\mathcal{L}_{\text{inv}} pulls augmented views together. But without additional constraints, representations collapse to a single point—zero free energy, but also zero information. The variance and covariance terms prevent this.

The variance term ensures no dimension collapses,

Lvar=j=1dmax(0,γVar(zj)+ϵ),\mathcal{L}_{\text{var}} = \sum_{j=1}^{d} \max(0, \gamma - \sqrt{\text{Var}(z_j) + \epsilon}),

where zjz_j is the jj-th embedding dimension across the batch. This creates a variance floor—a minimum energy barrier that prevents dimensions from dying. Each dimension must carry at least γ\gamma standard deviations of variance or pay a penalty.

The covariance term enforces dimensional independence,

Lcov=1dij[Cov(zi,zj)]2.\mathcal{L}_{\text{cov}} = \frac{1}{d} \sum_{i \neq j} [\text{Cov}(z_i, z_j)]^2.

This penalizes redundancy. If two dimensions encode the same information, the system pays a cost. The constraint forces the manifold to spread its representational capacity across all available dimensions.

Together, these terms define the constraint manifold Mallowed\mathcal{M}_{\text{allowed}}. Representations must live in a region where:

  • Augmented views align (low Linv\mathcal{L}_{\text{inv}})
  • All dimensions remain active (low Lvar\mathcal{L}_{\text{var}})
  • Dimensions stay independent (low Lcov\mathcal{L}_{\text{cov}})

The empirical findings become geometric necessity. The paper reports λ=25,μ=25,ν=1\lambda = 25, \mu = 25, \nu = 1. After accounting for batch normalization and scaling, the effective weights are 0.04\sim 0.04 for variance and covariance control.

Why 0.040.04? From information physics, systems maintain coherence when organizational overhead η\eta stays below a critical threshold,

η<ηc=1ρ=0.304,\eta < \eta_c = \frac{1}{\rho^*} = 0.304,

where ρ=π(3+5)53.29\rho^* = \frac{\pi(3+\sqrt{5})}{5} \approx 3.29 governs systems with pentagonal symmetry. The variance/covariance weights partition this budget: ρ/1000.033\rho^*/100 \approx 0.033, within 20% of the empirical value.

What this reveals: VICReg’s “hyperparameters” emerge from the geometry of coherence maintenance. The method works because it keeps η\eta below ηc\eta_c, preventing the dimensional collapse that occurs when organizational overhead exceeds critical thresholds.


The Momentum Path: How DINO Maintains Coherence Across Timescales

DINO takes a different approach. No explicit variance penalties. No covariance terms. Instead, it maintains coherence through temporal structure.

The core mechanism is the momentum teacher,

θteachermθteacher+(1m)θstudent,\theta_{\text{teacher}} \leftarrow m\theta_{\text{teacher}} + (1-m)\theta_{\text{student}},

where m[0.996,0.9995]m \in [0.996, 0.9995] creates an exponential moving average of student weights.

This creates timescale separation. The student network adapts quickly to each batch, tracking rapid variations in the data. The teacher evolves slowly, filtering out high-frequency noise. The system minimizes

L=x{x1,x2}i=1CPt(i)(x)logPs(i)(x),\mathcal{L} = -\sum_{x \in \{x_1, x_2\}} \sum_{i=1}^C P_t^{(i)}(x) \log P_s^{(i)}(x'),

where PtP_t is the teacher’s softmax output and PsP_s is the student’s prediction on a different augmentation.

The momentum parameter determines the teacher’s time constant,

τteacher=11m.\tau_{\text{teacher}} = \frac{1}{1-m}.

For m=0.996m = 0.996, this gives τ=250\tau = 250 update steps. The teacher integrates information over 250 batches before significantly changing its predictions.

Why does this prevent collapse? The slow-moving teacher acts as a coherence anchor. When the student tries to collapse representations, the teacher still maintains diversity from hundreds of previous updates. The cross-entropy loss pulls the student toward the teacher’s stable distribution, preventing runaway collapse dynamics.

The centering operation reinforces this,

cαc+(1α)1Bi=1Bzi,c \leftarrow \alpha c + (1-\alpha) \frac{1}{B}\sum_{i=1}^B z_i, zcentered=zc.z_{\text{centered}} = z - c.

This maintains a variance floor implicitly by preventing all embeddings from drifting toward a single point. The exponential moving average of the center keeps Lvar\mathcal{L}_{\text{var}} bounded without explicit regularization.

The sharpening temperature τs=0.1\tau_s = 0.1 (teacher) versus τt=0.04\tau_t = 0.04 (student) creates asymmetry,

P(x)=exp(zw/τ)kexp(zwk/τ).P(x) = \frac{\exp(z \cdot w / \tau)}{\sum_k \exp(z \cdot w_k / \tau)}.

Lower temperature sharpens the distribution, forcing the teacher to make confident predictions. Higher temperature softens the student’s distribution, allowing it to explore. This asymmetry generates a training signal that pulls the student toward confident, stable representations.

What this reveals: DINO implements the same constraint geometry as VICReg through temporal rather than spatial mechanisms. The momentum parameter m=0.996m = 0.996 sets a timescale τ=250\tau = 250 that keeps coherence deviation CD(t)\mathrm{CD}(t) bounded. When mm is too small (m=0.99,τ=100m = 0.99, \tau = 100), the teacher changes too quickly and loses its anchoring function. The system crosses into the unstable regime where CD(t)\mathrm{CD}(t) grows faster than gradient descent can correct it.


The Contrastive Path: How SimCLR Tiles Representational Space

SimCLR abandons both explicit variance terms and momentum teachers. Instead, it uses contrastive learning with massive batches.

The contrastive loss for a positive pair (i,j)(i,j) is

Li,j=logexp(sim(zi,zj)/τ)k=12N1[ki]exp(sim(zi,zk)/τ),\mathcal{L}_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)},

where NN is the batch size (so 2N2N total augmented views) and τ\tau is the temperature parameter.

The denominator is critical. Each positive pair sees 2N22N-2 negative samples. These negatives tile the representational space, creating repulsive forces that prevent collapse. When embeddings try to cluster at a single point, the contrastive loss pushes them apart to distinguish positives from negatives.

But how many negatives are needed? This is a geometric percolation problem. For representations to span a dd-dimensional manifold, they must form a connected graph where nearby points can be distinguished.

The percolation threshold for random graphs is

pc=1k,p_c = \frac{1}{\langle k \rangle},

where k\langle k \rangle is the average degree. For SSL, the effective dimensionality is deff=dτd_{\text{eff}} = d\tau where dd is the embedding dimension and τ\tau is the temperature parameter.

For SimCLR’s configuration (d=128,τ=0.07d = 128, \tau = 0.07),

deff=128×0.079.d_{\text{eff}} = 128 \times 0.07 \approx 9.

The critical batch size for manifold percolation becomes

Ncrit=exp(deff/ρ)exp(9/3.29)15.N_{\text{crit}} = \exp(d_{\text{eff}}/\rho^*) \approx \exp(9/3.29) \approx 15.

But this is the theoretical minimum. Real optimization dynamics, gradient noise, and finite sampling push the practical requirement much higher. The empirical finding: batch size 4096 provides sufficient negative samples to stabilize training.

The temperature parameter τ=0.07\tau = 0.07 controls how the contrastive geometry spreads representations,

effective dimensionsdτ.\text{effective dimensions} \propto \frac{d}{\tau}.

Lower temperature sharpens the softmax, making the loss focus on the hardest negatives. This creates stronger repulsive forces but requires more negatives to cover the space. Higher temperature softens the distribution, reducing the need for negatives but providing weaker training signal.

What this reveals: SimCLR shapes κ\kappa through negative sample density. The batch size defines Mallowed\mathcal{M}_{\text{allowed}} directly. With insufficient negatives, the constrained optimum pp^* sits too far above the ideal FF^*, making κ\kappa too large for gradient descent to reach. The requirement for 4096 samples marks the point where contrastive geometry achieves sufficient manifold coverage to keep κ\kappa bounded.


The Predictive Path: How BYOL and JEPA Cross Recursive Thresholds

BYOL removes contrastive losses entirely. No negative samples. No massive batches. Yet it works.

The architecture adds a predictor network qθq_\theta that maps the online network’s output toward the target network’s output

L=qθ(fθ(x))fξ(x)22,\mathcal{L} = \|q_\theta(f_\theta(x)) - f_\xi(x')\|_2^2,

where θ\theta are online network parameters, ξ\xi are target (momentum) network parameters, and x,xx, x' are different augmentations.

The predictor is the key. It adds a directional mapping that increases effective dimensionality,

deff=dencoder+dpredictor.d_{\text{eff}} = d_{\text{encoder}} + d_{\text{predictor}}.

This reduces κ\kappa by expanding the constraint manifold. The system can now represent transformations between augmentations, not just the augmentations themselves. The predictor learns Δz=zz\Delta z = z' - z, capturing the structure of the augmentation space.

Combined with momentum (m=0.996m = 0.996), this creates stability without negatives. The predictor shapes the manifold while momentum anchors coherence across time.

JEPA extends this to multi-step prediction,

L=t=1Tsψ(sθ(xt))sθ(xt+k)22,\mathcal{L} = \sum_{t=1}^{T} \|s_\psi(s_\theta(x_t)) - s_\theta(x_{t+k})\|_2^2,

where sθs_\theta is the encoder and sψs_\psi is the predictor that forecasts kk steps ahead.

The remarkable finding: stability emerges at k=10k = 10 steps. Shorter horizons (k<10k < 10) lead to unstable training. Why ten?

From recursive closure theory, systems achieve stable self-modeling when they can represent their own dynamics over sufficient horizons. The mathematical manifestation appears in the Leibniz series for π\pi,

π4=n=0(1)n2n+1=113+1517+\frac{\pi}{4} = \sum_{n=0}^{\infty} \frac{(-1)^n}{2n+1} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \cdots

The efficiency of the NN-term partial sum is,

efficiency(N)=Leibniz(N)N.\text{efficiency}(N) = \frac{\text{Leibniz}(N)}{N}.

At N=10N = 10, this reaches

efficiency(10)0.304ηc=1ρ.\text{efficiency}(10) \approx 0.304 \approx \eta_c = \frac{1}{\rho^*}.

This is the recursive closure threshold—the point where the organizational overhead required to maintain coherent predictions crosses into the stable regime.

JEPA’s 10-step horizon isn’t chosen for semantic reasons (“10 steps is meaningful”). It’s geometric necessity. Below 10 steps, the system lacks sufficient recursive depth to form closed predictive loops. The representations can’t stabilize because they can’t model their own evolution over adequate horizons.

Transformers show the same threshold. Coherent reasoning emerges around depth D=1012D = 10\text{--}12 layers. Below this, the recursive operations can’t compose into stable self-modeling dynamics.

What this reveals: Predictive methods solve the constraint problem by expanding representational capacity (predictor networks) and ensuring recursive closure (sufficient prediction horizon). The decade threshold appears in JEPA’s 10-step prediction, transformers’ 10-12 layer emergence, and the Leibniz efficiency curve because they’re measuring the same geometric property: the minimum depth required for systems to achieve stable self-representation.


Where the Paths Converge

Four completely different approaches. VICReg explicitly regularizes variance and covariance. DINO uses momentum and centering. SimCLR tiles space with contrastive negatives. BYOL and JEPA predict forward through time. Yet they all solve the same equation,

F[qt]F=κ+CD(t).F[q_t] - F^* = \kappa + \mathrm{CD}(t).

Each method controls different terms:

  • VICReg raises κ\kappa through variance/covariance penalties to prevent collapse, but keeps κ\kappa small enough that gradient descent can reach pp^*. The weights 0.04ρ/100\sim 0.04 \approx \rho^*/100 partition the organizational overhead to stay below ηc\eta_c.

  • DINO controls CD(t)\mathrm{CD}(t) through momentum timescales. The teacher’s τ=250\tau = 250 step integration window bounds how far the student can drift from coherent representations. Centering maintains implicit variance floors.

  • SimCLR reduces κ\kappa by ensuring the constraint manifold has sufficient geometric coverage. Batch size 4096 provides the negative sample density needed for manifold percolation in deff9d_{\text{eff}} \approx 9 effective dimensions.

  • BYOL reduces κ\kappa through predictor networks that expand representational capacity. Momentum controls CD(t)\mathrm{CD}(t).

  • JEPA achieves recursive closure at 10-step prediction horizon, crossing the threshold where η<ηc\eta < \eta_c becomes sustainable for self-modeling representations.

The constraint geometry forces all stable methods toward the same structure. The underlying mathematics is identical across implementations.

When systems must minimize free energy under architectural constraints, they can only succeed by

  • Keeping structural cost κ\kappa bounded
  • Preventing coherence deviation CD(t)\mathrm{CD}(t) from growing too fast
  • Maintaining organizational overhead η<ηc\eta < \eta_c

Every SSL method that works does exactly this, whether the designers knew it or not. The constraint geometry forced the convergence.


The Organizational Overhead and Collapse Dynamics

The convergence reveals a deeper constraint: organizational overhead must stay below critical thresholds.

From information physics, any system maintaining structure while processing information carries an organizational charge η\eta—the fraction of capacity devoted to maintaining coherence rather than processing new information.

Physical systems show a consistent pattern:

  • Particles: η106\eta \sim 10^{-6}
  • Atoms: η103\eta \sim 10^{-3}
  • Molecules: η102\eta \sim 10^{-2}
  • Biological systems: η101\eta \sim 10^{-1}
  • Event horizons: η=1\eta = 1

The progression follows a renormalization flow,

β(η)=η(1η)[ρ+d22lnϕ],\beta(\eta) = -\eta(1-\eta)\left[\rho^* + \frac{d-2}{2}\ln\phi\right],

where ϕ=1+52\phi = \frac{1+\sqrt{5}}{2} is the golden ratio and ρ3.29\rho^* \approx 3.29.

The critical point appears at

ηc=1ρ=0.304.\eta_c = \frac{1}{\rho^*} = 0.304.

Systems operating below ηc\eta_c maintain coherence. Systems crossing this threshold collapse—dimensional reduction near black holes, seizure activity in neural circuits, representational collapse in SSL.

The collapse modes in SSL become interpretable:

  • Insufficient variance regularization (Lvar\mathcal{L}_{\text{var}} too small): Dimensions die, concentrating overhead in fewer active dimensions. This increases η\eta until it crosses ηc\eta_c and the system collapses to a point.

  • Insufficient momentum (mm too small in DINO): The teacher changes too rapidly, losing its coherence-anchoring function. CD(t)\mathrm{CD}(t) grows faster than gradient descent can correct, and representations become unstable.

  • Insufficient negatives (batch size too small in SimCLR): The contrastive geometry fails to cover the manifold. κ\kappa becomes too large—the constrained optimum sits too far from the ideal. Gradient descent can’t reach it.

  • Insufficient prediction horizon (k<10k < 10 in JEPA): The system lacks recursive closure. It cannot form stable self-models, leading to η>ηc\eta > \eta_c in the prediction pathway.

These are the same failures viewed through different lenses. The system violates the constraint geometry and η\eta crosses ηc\eta_c in each case.

What Collapse Actually Looks Like

Abstract mathematics becomes concrete when you watch representations die. The constraint geometry predicts both that systems collapse and how they collapse—the temporal signatures, spectral patterns, and geometric transformations that mark the transition from stable learning to catastrophic failure.

Dimensional Death Cascade (Insufficient Variance Control)

Training proceeds normally for 100-200 steps. Loss decreases smoothly. Then the first dimension dies—its variance drops below the noise floor. Within 10-20 steps, a second dimension collapses. Then a third. The cascade accelerates exponentially:

  • Visual signature: Eigenvalue spectrum develops a sharp cliff. The largest eigenvalue grows while smaller eigenvalues collapse toward zero. Plot the eigenvalue ratio λ1/λ10\lambda_1 / \lambda_{10}—healthy training keeps this <10< 10. In dimensional death cascade, it crosses 100 within 50 steps, then 1000 within another 50 steps.

  • Loss signature: Training loss continues decreasing (the model can still fit the data with fewer dimensions), but validation metrics diverge. The gap between training and validation loss grows super-linearly. Downstream task performance drops 10-20% even as SSL loss improves.

  • Representation signature: Embeddings collapse toward a lower-dimensional subspace. Computing the participation ratio PR=(λi)2/λi2PR = (\sum \lambda_i)^2 / \sum \lambda_i^2 shows the effective dimensionality. Healthy SSL maintains PR0.5×dPR \geq 0.5 \times d where dd is nominal dimension. Dimensional death cascade shows PRPR dropping below 0.1×d0.1 \times d in 100-200 steps.

  • Recovery: Impossible past 50% dimensional loss. Early intervention (steps 1-30 of cascade) can rescue training by increasing variance regularization 2-5×. Late intervention (steps 50+) requires restart from earlier checkpoint.

This cascade pattern reveals the fragility of unconstrained dimensional compression. When variance regularization fails to maintain the geometric floor, the system enters a runaway collapse where each lost dimension accelerates the death of remaining dimensions. The next failure mode shows a different geometric pathology—not dimensional death but temporal instability.

Teacher-Student Oscillation (Insufficient Momentum)

Training shows characteristic periodic instability. Representations swing between over-fitting recent batches and over-smoothing historical information. The period matches 2τ2\tau where τ=1/(1m)\tau = 1/(1-m) is the momentum timescale:

  • Visual signature: Plot cosine similarity between teacher and student embeddings over time. Healthy training shows slow drift (linear increase from 0.7 to 0.9 over thousands of steps). Oscillation mode shows periodic swings with amplitude 0.1-0.2 and period 100-200 steps when m=0.99m = 0.99.

  • Loss signature: Training loss oscillates with the same period as teacher-student similarity. Each cycle: loss decreases for τ\tau steps, then suddenly jumps 10-30%, then decreases again. The oscillation amplitude grows over time—early training shows 5% swings, late training shows 30%+ swings.

  • Representation signature: PCA of embeddings over time reveals periodic geometric rotation. The first two principal components trace elliptical paths rather than staying fixed. The ellipse expands over time as oscillation amplitude grows.

  • Recovery: Increase momentum from m=0.99m = 0.99 to m=0.996m = 0.996 or higher. This increases τ\tau from 100 to 250 steps, slowing teacher evolution and damping oscillations. Recovery is possible at any point but requires 2-3× the oscillation period to stabilize.

Unlike dimensional collapse which eliminates information capacity permanently, oscillation failure preserves capacity but prevents stable convergence. The system has enough dimensions but lacks the temporal anchoring to settle into coherent geometry. The next mode combines both pathologies—preserved but fragmented structure.

Manifold Fragmentation (Insufficient Negatives)

Representations form disconnected clusters rather than a continuous manifold. The number of clusters scales as B\sqrt{B} where BB is batch size. For batch size 256, expect 16 clusters. For batch size 64, expect 8 clusters:

  • Visual signature: t-SNE or UMAP visualization shows distinct islands rather than continuous structure. Clustering coefficient C=(triangles)/(connected triples)C = (\text{triangles}) / (\text{connected triples}) quantifies fragmentation. Healthy contrastive learning shows C0.3C \geq 0.3. Fragmentation shows C<0.1C < 0.1.

  • Loss signature: Training loss plateaus prematurely. The contrastive objective can’t push clusters apart further without more negative samples. Loss reaches 1.5-2.0 and stops improving, while healthy training would reach 0.5-1.0.

  • Representation signature: Within-cluster similarity is very high (0.9+) but between-cluster similarity is near zero (< 0.1). This bimodal similarity distribution indicates disconnected manifold components. Healthy training shows unimodal distribution centered around 0.3-0.5.

  • Recovery: Increase batch size 4× (doubling batch size only improves by 2\sqrt{2}). Alternatively, use momentum queues (MoCo-style) to increase effective negatives without memory constraints. Or switch to non-contrastive methods (VICReg, DINO) that don’t require explicit negatives.

Fragmentation represents a subtle failure—local structure remains coherent but global connectivity breaks. Each cluster learns valid representations, but the manifold loses its ability to interpolate between distant regions. The final failure mode operates at yet another scale—not spatial but temporal frequency.

High-Frequency Jitter (Insufficient Prediction Horizon)

Representations oscillate at characteristic frequency f1/kf \approx 1/k where kk is prediction steps. For 5-step prediction, expect oscillations every 5 steps. For 3-step prediction, every 3 steps:

  • Visual signature: Fourier transform of embedding norms over time shows sharp peak at f=1/kf = 1/k. Healthy long-horizon prediction shows flat spectrum. Short-horizon prediction shows spectral peak 10-100× above noise floor.

  • Loss signature: Prediction loss oscillates with period kk. Plot moving average with window size kk—healthy training shows smooth decrease. Insufficient horizon shows oscillation amplitude comparable to the trend (signal-to-noise ratio near 1).

  • Representation signature: Consecutive checkpoints (saved every kk steps) show high variability. Computing Δd(t)=qtqtk\Delta d(t) = \|q_t - q_{t-k}\| for consecutive checkpoints gives high values (> 0.5 relative distance). Healthy training shows Δd<0.2\Delta d < 0.2.

  • Recovery: Increase prediction horizon k10k \geq 10 steps. This crosses the recursive closure threshold where organizational overhead drops below ηc\eta_c. The high-frequency oscillations disappear within 50-100 steps of adjustment.

These four failure modes—dimensional collapse, temporal oscillation, spatial fragmentation, and frequency jitter—exhaust the ways constraint geometry can break. Each corresponds to violating a different geometric requirement: variance floors, momentum anchoring, manifold connectivity, or recursive closure. The diagnostic protocol below provides systematic tools to detect which constraint is failing.

Common Diagnostic Protocol

When SSL training shows instability, run this diagnostic sequence:

  1. Eigenvalue spectrum: Plot sorted eigenvalues λi\lambda_i. Check if λ1/λ10>50\lambda_1/\lambda_{10} > 50 (dimensional collapse) or if spectrum shows exponential decay λiei/τ\lambda_i \sim e^{-i/\tau} with τ<5\tau < 5 (too much compression).

  2. Teacher-student similarity: For momentum methods, plot cosine similarity over training. Check for oscillations with period 100200\sim 100\text{--}200 steps (momentum too low) or drift >0.05> 0.05 per 100 steps (momentum too high or learning rate too high).

  3. Participation ratio: Compute PR=(λi)2/λi2PR = (\sum \lambda_i)^2 / \sum \lambda_i^2 every 50 steps. If PRPR drops below 0.3×d0.3 \times d, variance regularization is insufficient.

  4. Clustering coefficient: Sample 1000 embeddings, compute pairwise similarities, threshold at 0.5, and calculate clustering coefficient. If C<0.15C < 0.15, increase batch size or switch methods.

  5. Spectral frequency analysis: FFT of embedding norms or losses. Sharp peaks indicate characteristic timescales. Match peak frequency to method parameters (momentum timescale, prediction horizon, batch processing period).

This diagnostic protocol translates abstract geometric constraints into concrete monitoring tools. The mathematics predicts the signatures. The signatures predict the failures. The failures guide the interventions.


Design Principles That Follow

The convergence enables principled design rather than empirical search.

For Variance-Based Methods (VICReg-style)

Target effective weights near ρ/1000.033\rho^*/100 \approx 0.033 for variance and covariance terms. This balances collapse prevention against training rigidity. After accounting for batch normalization and scaling, aim for regularization strengths in the range

λeff,μeff[0.025,0.050].\lambda_{\text{eff}}, \mu_{\text{eff}} \in [0.025, 0.050].

This range provides sufficient collapse prevention without making training rigid.

For Momentum Methods (DINO-style)

Set momentum to achieve integration timescales around

τ=11m250 steps.\tau = \frac{1}{1-m} \approx 250 \text{ steps}.

For m[0.996,0.9995]m \in [0.996, 0.9995], this provides sufficient temporal anchoring. Too small (m<0.99,τ<100m < 0.99, \tau < 100), and the teacher loses coherence. Too large (m>0.9995,τ>2000m > 0.9995, \tau > 2000), and adaptation becomes too slow. This timescale balance maintains coherence without sacrificing adaptability.

For Contrastive Methods (SimCLR-style)

Calculate required batch size from effective dimensionality

Nmin=exp(dτρ)×safety factor,N_{\text{min}} = \exp\left(\frac{d \tau}{\rho^*}\right) \times \text{safety factor},

where safety factors of 10-100× account for optimization dynamics. For d=128,τ=0.07d = 128, \tau = 0.07

Nmin15×100=15004000.N_{\text{min}} \approx 15 \times 100 = 1500\text{--}4000.

These batch sizes ensure sufficient negative sample density for stable manifold coverage.

For Predictive Methods (JEPA-style)

Ensure prediction horizon k10k \geq 10 steps for recursive closure. Shorter horizons fail to stabilize. Use predictor networks to expand effective dimensionality

deff=dencoder+dpredictor.d_{\text{eff}} = d_{\text{encoder}} + d_{\text{predictor}}.

Combine with momentum (m0.996m \approx 0.996) to anchor coherence. This dual mechanism expands capacity while maintaining stability.

Monitoring System Health

Track the alignment between current representations and recent history

Δd(t)=qtqtτ,\Delta d(t) = \|q_t - q_{t-\tau}\|,

where τ\tau is the relevant timescale (e.g., momentum integration window). Rising Δd\Delta d indicates increasing CD(t)\mathrm{CD}(t)—the system is drifting from coherent structure.

Monitor effective dimensionality through eigenspectrum of the covariance matrix. Rapid eigenvalue decay signals dimensional collapse

ηeff=1effective ranktotal dimensions.\eta_{\text{eff}} = 1 - \frac{\text{effective rank}}{\text{total dimensions}}.

When ηeff\eta_{\text{eff}} approaches 0.30.3, the system nears the collapse threshold. These monitoring tools provide early warning signals before catastrophic failure.


Why This Matters for Representation Learning

The constraint geometry framework transforms SSL from empirical art to principled engineering.

New methods don’t require exhaustive hyperparameter search. Start from the constraint equations. Choose how to shape κ\kappa (variance terms, momentum, negatives, prediction). Ensure CD(t)\mathrm{CD}(t) dynamics stay bounded. Keep η<ηc\eta < \eta_c. The “magic numbers” follow from the geometry.

Failure modes become diagnosable. Representations collapse? Check if ηeff>0.3\eta_{\text{eff}} > 0.3. Training unstable? Measure Δd(t)\Delta d(t) to quantify coherence drift. Insufficient performance? Calculate if κ\kappa is too large given your architectural constraints.

Cross-domain insights become possible. The same constraint geometry appears in biological neural networks (synaptic homeostasis maintaining variance floors), physical systems (black hole dimensional reduction when η1\eta \to 1), and engineered systems (transformer emergence at depth 10-12). The mathematics connects domains that seemed unrelated.

Architectural choices gain theoretical grounding. Why do transformers need 10+ layers for reasoning? Recursive closure. Why does momentum ~0.996 work across so many methods? Timescale separation at τ250\tau \approx 250. Why do contrastive methods need huge batches? Manifold percolation in deffd_{\text{eff}} dimensions.

The convergence of VICReg, DINO, SimCLR, BYOL, and JEPA wasn’t historical accident. The constraint geometry forced it. The methods work because they obey the mathematics of coherence maintenance under representational constraints—whether the designers knew it or not.

When independent approaches built from completely different intuitions all arrive at the same “magic numbers,” they reveal structure that was there all along, waiting to be recognized. The geometry shaped the methods, not the other way around.


What the Constraint Geometry Predicts

If the convergence reveals genuine geometric necessity rather than historical accident, the constraint equations should make falsifiable predictions about systems not yet built, architectures not yet tested, and failure modes not yet encountered.

Architectural Predictions

Vision Transformers will require depth D[10,12]D \in [10, 12] for stable reasoning. The recursive closure threshold appears at N=10N = 10 in the Leibniz efficiency curve. Below this depth, transformers can’t form stable self-modeling representations. Above depth 12, additional layers provide diminishing returns because η\eta approaches ηc\eta_c and organizational overhead dominates. Empirical findings confirm: GPT-2 (12 layers), BERT base (12 layers), ViT base (12 layers)—the decade threshold repeats because it measures the minimum recursive depth for coherent abstraction.

Contrastive methods will plateau at batch sizes N212=4096N \approx 2^{12} = 4096. For typical SSL configurations (d=128,τ=0.07d = 128, \tau = 0.07), effective dimensionality deff=9d_{\text{eff}} = 9 requires Ncrit=exp(9/3.29)15N_{\text{crit}} = \exp(9/3.29) \approx 15 for manifold percolation. Safety factors of 100-300× push this to 1500-4500. SimCLR found 4096. CLIP trained with 32,768 batch size but gained marginal improvements above 8192. MoCo used momentum queues effectively expanding batch size to 65,536 but showed diminishing returns. The constraint geometry predicts the ceiling.

New SSL methods will converge on effective regularization ηeff0.030.04\eta_{\text{eff}} \approx 0.03\text{--}0.04 regardless of implementation. Whether through explicit variance terms (VICReg), implicit centering (DINO), negative sample density (SimCLR), or predictor networks (BYOL), any stable method must maintain organizational overhead near ρ/1000.033\rho^*/100 \approx 0.033. Methods with ηeff<0.02\eta_{\text{eff}} < 0.02 will underfit (insufficient constraint). Methods with ηeff>0.05\eta_{\text{eff}} > 0.05 will be too rigid (excessive constraint). This 0.03-0.04 band is geometric necessity.

Multimodal models will require cross-modal alignment losses scaled by 1/M1/\sqrt{M} where MM is the number of modalities. Each modality adds dimensional constraints. CLIP (2 modalities: vision + language) needs weaker alignment than ImageBind (6 modalities: image, text, audio, depth, thermal, IMU). The 1/M1/\sqrt{M} scaling keeps total organizational overhead η\eta bounded as modalities increase. Without this scaling, η\eta grows linearly with MM and crosses ηc\eta_c around M=34M = 3\text{--}4 modalities, causing collapse.

Training Dynamics Predictions

Learning rate warm-up duration must match momentum timescales. For momentum m=0.996m = 0.996 giving τ=250\tau = 250 steps, warm-up should span 250500\sim 250\text{--}500 steps. Shorter warm-up shocks the teacher-student dynamics before temporal anchoring establishes. Longer warm-up wastes compute in suboptimal regions. Empirically, DINO uses 10 epochs warm-up on ImageNet (5000 steps), SimCLR uses 10% of training as warm-up—both align with τ\tau timescales.

Representational collapse will occur when eigenvalue spectrum develops power-law tail with exponent α>2\alpha > 2. Healthy representations show eigenvalue decay λkkα\lambda_k \sim k^{-\alpha} with α[1.5,2]\alpha \in [1.5, 2]. When α>2\alpha > 2, effective dimensionality drops rapidly and ηeff\eta_{\text{eff}} approaches 0.3. This provides early warning 50-100 steps before visible collapse in loss curves.

Optimal checkpoint selection occurs when Δd(t)\Delta d(t) reaches local minimum. Tracking representation drift Δd(t)=qtqtτ\Delta d(t) = \|q_t - q_{t-\tau}\| over momentum timescale τ\tau reveals when the system settles into coherent basins. Local minima in Δd(t)\Delta d(t) correspond to stable geometric configurations—better checkpoint candidates than loss-based selection.

Failure Mode Predictions

Methods violating η<ηc\eta < \eta_c will collapse in predictable patterns:

  • Insufficient variance control (Lvar\mathcal{L}_{\text{var}} too weak): Dimensional death cascade where dimensions collapse sequentially rather than simultaneously. First dimension dies → overhead concentrates in remaining dimensions → second dimension dies → cascade accelerates. Time to full collapse: TcollapsedτT_{\text{collapse}} \approx d \cdot \tau where dd is embedding dimension and τ\tau is adaptation timescale.

  • Insufficient momentum (m<0.99m < 0.99): Teacher-student oscillations with period 2τ\sim 2\tau. Representations swing between over-fitting to recent batches and over-averaging historical information. Coherence deviation CD(t)\mathrm{CD}(t) grows as t\sqrt{t} rather than staying bounded.

  • Insufficient negatives (batch size <Ncrit< N_{\text{crit}}): Manifold fragmentation where representations form disconnected clusters. Number of clusters scales as B\sqrt{B} where BB is batch size. Doubling batch size reduces fragmentation by 2\sqrt{2}, explaining why improvements are sublinear in batch size.

  • Insufficient prediction horizon (k<10k < 10 in JEPA-style methods): Prediction instability with characteristic frequency f1/kf \approx 1/k. Short horizons can’t filter noise at timescales longer than prediction window. The system oscillates at frequencies just above 1/k1/k, creating high-frequency jitter in learned representations.

Cross-Domain Predictions

Biological neural networks should show η0.1\eta \approx 0.1 at circuit level. Synaptic homeostasis mechanisms (scaling, metaplasticity) function as variance regularization. The constraint geometry predicts biological systems operate closer to ηc\eta_c than artificial systems (0.1 vs 0.03) because they face stronger computational constraints. Measurements of metabolic overhead in cortical circuits show 10-15% of neural activity devoted to homeostatic regulation—matching predicted η0.1\eta \approx 0.1.

Emergent abilities in language models will appear at depth D10D \geq 10 and model scale where organizational overhead per parameter drops below ηc\eta_c. Smaller models can’t achieve η<0.3\eta < 0.3 because parameter-sharing forces higher overhead. The sharp “emergence” represents geometric threshold crossing. Chain-of-thought reasoning emerged in models ≥ 10B parameters with depth ≥ 12 layers precisely because this configuration first achieves sustained η<ηc\eta < \eta_c.

Information bottleneck methods will find optimal compression ratio β3.29=ρ\beta \approx 3.29 = \rho^*. The mutual information objective I(X;Z)βI(Z;Y)I(X;Z) - \beta I(Z;Y) balances compression against prediction. The constraint geometry predicts βρ\beta \approx \rho^* because this ratio maintains coherence at the edge of the constraint manifold. Empirical studies finding β[3,4]\beta \in [3, 4] for optimal generalization align with geometric prediction.

These predictions are falsifiable. If constraint geometry genuinely governs SSL, these patterns should appear across architectures, datasets, and training regimes with 10-20% precision. If the convergence was historical accident or domain-specific, these predictions will fail. The geometry requires these outcomes.


What the Convergence Reveals

Five methods. Five different intuitions. Variance regularization. Momentum teachers. Contrastive negatives. Predictive networks. Multi-step horizons. Yet they all converge on

  • Effective regularization strengths 0.030.04ρ/100\sim 0.03\text{--}0.04 \approx \rho^*/100
  • Momentum parameters 0.996\sim 0.996 giving timescale τ250\tau \approx 250
  • Batch sizes scaling as exp(deff/ρ)\exp(d_{\text{eff}}/\rho^*)
  • Prediction horizons 10\geq 10 steps
  • Organizational overhead η<0.304\eta < 0.304

The pattern repeats across architectures, datasets, and training regimes with geometric requirements precise within 10-20%. This precision emerges from constraint geometry, not curve fitting.

The deeper insight: self-supervised learning works by discovering representational manifolds that minimize free energy while respecting architectural constraints. The constraint geometry determines what manifolds are possible. Systems that violate the geometry collapse. Systems that respect it discover stable, coherent representations.

This is what the convergence shows. When you push representational systems to learn from unlabeled data while respecting finite capacity, finite compute, and finite depth, the mathematics forces them toward specific solutions. The constraint geometry requires these solutions.

The methods converged because the geometry of representation space doesn’t allow anything else.