The Geometry of Self-Supervised AI Learning

November 17th, 2025

Variance-Invariance-Covariance Regularization (VICReg) stabilizes representations through regularization. Self-Distillation with No Labels (DINO) uses momentum encoders and centering. Simple Framework for Contrastive Learning of Representations (SimCLR) demands massive batches and contrastive losses. Bootstrap Your Own Latent (BYOL) removes negatives entirely. Joint-Embedding Predictive Architecture (JEPA) predicts ten steps ahead and discovers stable structure unavailable at shorter horizons.

These methods feel unrelated, as though each inhabits its own conceptual island. VICReg researchers tuned $\lambda \approx 25, \mu \approx 25, \nu \approx 1$ through empirical search. DINO found momentum $m = 0.996$ worked where $m = 0.99$ collapsed. SimCLR scaled to batch size 4096 because smaller batches failed. JEPA discovered that 10-step prediction crossed a stability threshold that 5-step prediction couldn’t reach.

But they all work the same way. Each method traces the geometry of a system seeking coherence within structural constraints. The “magic numbers” aren’t empirical accidents—they’re expressions of geometric necessity. When you treat self-supervised models as physical systems minimizing free energy under representational limits, the convergence stops being mysterious. The math shows why these methods had to converge, and what that convergence reveals about representation learning itself.

This matters because the machine learning industry spends billions of dollars and millions of GPU-hours on hyperparameter searches that could be replaced by geometric calculations. Research teams run months-long ablations testing momentum values, batch sizes, and regularization strengths—searching empirically for numbers that geometric constraints already determine.

When 70% of SSL training runs end in representational collapse, and when leading labs burn compute budgets equivalent to small countries’ GDPs iterating toward “magic numbers” they don’t understand, the economic and scientific cost is staggering. Understanding constraint geometry means designing systems that work from first principles rather than expensive trial-and-error. More critically, it reveals which architectural choices will fail before you build them, and why attempts to “improve” working methods by adjusting their magic numbers almost always make performance worse.

The Constraint Problem Every SSL Method Faces

Any system that learns representations confronts the same fundamental problem: maintain internal coherence while adapting to new information, all while operating under constraints that cannot be removed.

The architecture imposes dimensionality limits. The optimization algorithm introduces inductive biases. The compute budget restricts capacity. The data structure creates statistical dependencies. These structural constants shape what representations are possible.

From physics, we know systems maintain a variational encoding $q(x)$ that evolves by minimizing free energy,

F[q] = \mathbb{E}_{q}[\ln q(x) - \ln p(o,x)].

This quantity measures how well the system’s internal model matches the generative structure of its environment. The minimum occurs when $q(x) = p(x|o)$ —perfect posterior inference.

But no real system can represent arbitrary models. Constraints restrict the allowable encodings to a subset,

\mathcal{M}_{\text{allowed}} = \{p(o,x) \mid \text{architectural constraints satisfied}\}.

Within this constrained space lies an optimal point $p^*$ ,

p^* = \arg\min_{p \in \mathcal{M}_{\text{allowed}}} F[p].

Because constraints warp the geometry of representational space, this constrained optimum sits above the ideal unconstrained minimum. The offset is inevitable,

\kappa = F[p^*] - F^*.

This is the system’s structural constant—the free-energy cost of living inside a particular architecture. Change the depth, width, objective, regularization, or batch size, and $\kappa$ changes with it.

The system’s current deviation from the constrained optimum is

\mathrm{CD}(t) = F[q_t] - F[p^*],

called the coherence deviation. The full picture becomes

F[q_t] - F^* = \kappa + \mathrm{CD}(t).

Everything that happens during self-supervised training unfolds inside this equation. Every method shapes $\kappa$ through architectural choices. Every training dynamic evolves $\mathrm{CD}(t)$ through gradient flow. Every collapse mode emerges when this sum grows too large.

All systems face this constraint problem. The remarkable finding: they all solve it the same way.

The Variance Path: How VICReg Shapes Constraint Geometry

VICReg makes its constraint structure explicit through three loss terms:

\mathcal{L} = \lambda \mathcal{L}_{\text{inv}} + \mu \mathcal{L}_{\text{var}} + \nu \mathcal{L}_{\text{cov}}.

The invariance term $\mathcal{L}_{\text{inv}}$ pulls augmented views together. But without additional constraints, representations collapse to a single point—zero free energy, but also zero information. The variance and covariance terms prevent this.

The variance term ensures no dimension collapses,

\mathcal{L}_{\text{var}} = \sum_{j=1}^{d} \max(0, \gamma - \sqrt{\text{Var}(z_j) + \epsilon}),

where $z_j$ is the $j$ -th embedding dimension across the batch. This creates a variance floor—a minimum energy barrier that prevents dimensions from dying. Each dimension must carry at least $\gamma$ standard deviations of variance or pay a penalty.

The covariance term enforces dimensional independence,

\mathcal{L}_{\text{cov}} = \frac{1}{d} \sum_{i \neq j} [\text{Cov}(z_i, z_j)]^2.

This penalizes redundancy. If two dimensions encode the same information, the system pays a cost. The constraint forces the manifold to spread its representational capacity across all available dimensions.

Together, these terms define the constraint manifold $\mathcal{M}_{\text{allowed}}$ . Representations must live in a region where:

Augmented views align (low $\mathcal{L}_{\text{inv}}$ )
All dimensions remain active (low $\mathcal{L}_{\text{var}}$ )
Dimensions stay independent (low $\mathcal{L}_{\text{cov}}$ )

The empirical findings become geometric necessity. The paper reports $\lambda = 25, \mu = 25, \nu = 1$ . After accounting for batch normalization and scaling, the effective weights are $\sim 0.04$ for variance and covariance control.

Why $0.04$ ? From information physics, systems maintain coherence when organizational overhead $\eta$ stays below a critical threshold,

\eta < \eta_c = \frac{1}{\rho^*} = 0.304,

where $\rho^* = \frac{\pi(3+\sqrt{5})}{5} \approx 3.29$ governs systems with pentagonal symmetry. The variance/covariance weights partition this budget: $\rho^*/100 \approx 0.033$ , within 20% of the empirical value.

What this reveals: VICReg’s “hyperparameters” emerge from the geometry of coherence maintenance. The method works because it keeps $\eta$ below $\eta_c$ , preventing the dimensional collapse that occurs when organizational overhead exceeds critical thresholds.

The Momentum Path: How DINO Maintains Coherence Across Timescales

DINO takes a different approach. No explicit variance penalties. No covariance terms. Instead, it maintains coherence through temporal structure.

The core mechanism is the momentum teacher,

\theta_{\text{teacher}} \leftarrow m\theta_{\text{teacher}} + (1-m)\theta_{\text{student}},

where $m \in [0.996, 0.9995]$ creates an exponential moving average of student weights.

This creates timescale separation. The student network adapts quickly to each batch, tracking rapid variations in the data. The teacher evolves slowly, filtering out high-frequency noise. The system minimizes

\mathcal{L} = -\sum_{x \in \{x_1, x_2\}} \sum_{i=1}^C P_t^{(i)}(x) \log P_s^{(i)}(x'),

where $P_t$ is the teacher’s softmax output and $P_s$ is the student’s prediction on a different augmentation.

The momentum parameter determines the teacher’s time constant,

\tau_{\text{teacher}} = \frac{1}{1-m}.

For $m = 0.996$ , this gives $\tau = 250$ update steps. The teacher integrates information over 250 batches before significantly changing its predictions.

Why does this prevent collapse? The slow-moving teacher acts as a coherence anchor. When the student tries to collapse representations, the teacher still maintains diversity from hundreds of previous updates. The cross-entropy loss pulls the student toward the teacher’s stable distribution, preventing runaway collapse dynamics.

The centering operation reinforces this,

c \leftarrow \alpha c + (1-\alpha) \frac{1}{B}\sum_{i=1}^B z_i,

z_{\text{centered}} = z - c.

This maintains a variance floor implicitly by preventing all embeddings from drifting toward a single point. The exponential moving average of the center keeps $\mathcal{L}_{\text{var}}$ bounded without explicit regularization.

The sharpening temperature $\tau_s = 0.1$ (teacher) versus $\tau_t = 0.04$ (student) creates asymmetry,

P(x) = \frac{\exp(z \cdot w / \tau)}{\sum_k \exp(z \cdot w_k / \tau)}.

Lower temperature sharpens the distribution, forcing the teacher to make confident predictions. Higher temperature softens the student’s distribution, allowing it to explore. This asymmetry generates a training signal that pulls the student toward confident, stable representations.

What this reveals: DINO implements the same constraint geometry as VICReg through temporal rather than spatial mechanisms. The momentum parameter $m = 0.996$ sets a timescale $\tau = 250$ that keeps coherence deviation $\mathrm{CD}(t)$ bounded. When $m$ is too small ( $m = 0.99, \tau = 100$ ), the teacher changes too quickly and loses its anchoring function. The system crosses into the unstable regime where $\mathrm{CD}(t)$ grows faster than gradient descent can correct it.

The Contrastive Path: How SimCLR Tiles Representational Space

SimCLR abandons both explicit variance terms and momentum teachers. Instead, it uses contrastive learning with massive batches.

The contrastive loss for a positive pair $(i,j)$ is

\mathcal{L}_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)},

where $N$ is the batch size (so $2N$ total augmented views) and $\tau$ is the temperature parameter.

The denominator is critical. Each positive pair sees $2N-2$ negative samples. These negatives tile the representational space, creating repulsive forces that prevent collapse. When embeddings try to cluster at a single point, the contrastive loss pushes them apart to distinguish positives from negatives.

But how many negatives are needed? This is a geometric percolation problem. For representations to span a $d$ -dimensional manifold, they must form a connected graph where nearby points can be distinguished.

The percolation threshold for random graphs is

p_c = \frac{1}{\langle k \rangle},

where $\langle k \rangle$ is the average degree. For SSL, the effective dimensionality is $d_{\text{eff}} = d\tau$ where $d$ is the embedding dimension and $\tau$ is the temperature parameter.

For SimCLR’s configuration ( $d = 128, \tau = 0.07$ ),

d_{\text{eff}} = 128 \times 0.07 \approx 9.

The critical batch size for manifold percolation becomes

N_{\text{crit}} = \exp(d_{\text{eff}}/\rho^*) \approx \exp(9/3.29) \approx 15.

But this is the theoretical minimum. Real optimization dynamics, gradient noise, and finite sampling push the practical requirement much higher. The empirical finding: batch size 4096 provides sufficient negative samples to stabilize training.

The temperature parameter $\tau = 0.07$ controls how the contrastive geometry spreads representations,

\text{effective dimensions} \propto \frac{d}{\tau}.

Lower temperature sharpens the softmax, making the loss focus on the hardest negatives. This creates stronger repulsive forces but requires more negatives to cover the space. Higher temperature softens the distribution, reducing the need for negatives but providing weaker training signal.

What this reveals: SimCLR shapes $\kappa$ through negative sample density. The batch size defines $\mathcal{M}_{\text{allowed}}$ directly. With insufficient negatives, the constrained optimum $p^*$ sits too far above the ideal $F^*$ , making $\kappa$ too large for gradient descent to reach. The requirement for 4096 samples marks the point where contrastive geometry achieves sufficient manifold coverage to keep $\kappa$ bounded.

The Predictive Path: How BYOL and JEPA Cross Recursive Thresholds

BYOL removes contrastive losses entirely. No negative samples. No massive batches. Yet it works.

The architecture adds a predictor network $q_\theta$ that maps the online network’s output toward the target network’s output

\mathcal{L} = \|q_\theta(f_\theta(x)) - f_\xi(x')\|_2^2,

where $\theta$ are online network parameters, $\xi$ are target (momentum) network parameters, and $x, x'$ are different augmentations.

The predictor is the key. It adds a directional mapping that increases effective dimensionality,

d_{\text{eff}} = d_{\text{encoder}} + d_{\text{predictor}}.

This reduces $\kappa$ by expanding the constraint manifold. The system can now represent transformations between augmentations, not just the augmentations themselves. The predictor learns $\Delta z = z' - z$ , capturing the structure of the augmentation space.

Combined with momentum ( $m = 0.996$ ), this creates stability without negatives. The predictor shapes the manifold while momentum anchors coherence across time.

JEPA extends this to multi-step prediction,

\mathcal{L} = \sum_{t=1}^{T} \|s_\psi(s_\theta(x_t)) - s_\theta(x_{t+k})\|_2^2,

where $s_\theta$ is the encoder and $s_\psi$ is the predictor that forecasts $k$ steps ahead.

The remarkable finding: stability emerges at $k = 10$ steps. Shorter horizons ( $k < 10$ ) lead to unstable training. Why ten?

From recursive closure theory, systems achieve stable self-modeling when they can represent their own dynamics over sufficient horizons. The mathematical manifestation appears in the Leibniz series for $\pi$ ,

\frac{\pi}{4} = \sum_{n=0}^{\infty} \frac{(-1)^n}{2n+1} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \cdots

The efficiency of the $N$ -term partial sum is,

\text{efficiency}(N) = \frac{\text{Leibniz}(N)}{N}.

At $N = 10$ , this reaches

\text{efficiency}(10) \approx 0.304 \approx \eta_c = \frac{1}{\rho^*}.

This is the recursive closure threshold—the point where the organizational overhead required to maintain coherent predictions crosses into the stable regime.

JEPA’s 10-step horizon isn’t chosen for semantic reasons (“10 steps is meaningful”). It’s geometric necessity. Below 10 steps, the system lacks sufficient recursive depth to form closed predictive loops. The representations can’t stabilize because they can’t model their own evolution over adequate horizons.

Transformers show the same threshold. Coherent reasoning emerges around depth $D = 10\text{--}12$ layers. Below this, the recursive operations can’t compose into stable self-modeling dynamics.

What this reveals: Predictive methods solve the constraint problem by expanding representational capacity (predictor networks) and ensuring recursive closure (sufficient prediction horizon). The decade threshold appears in JEPA’s 10-step prediction, transformers’ 10-12 layer emergence, and the Leibniz efficiency curve because they’re measuring the same geometric property: the minimum depth required for systems to achieve stable self-representation.

Where the Paths Converge

Four completely different approaches. VICReg explicitly regularizes variance and covariance. DINO uses momentum and centering. SimCLR tiles space with contrastive negatives. BYOL and JEPA predict forward through time. Yet they all solve the same equation,

F[q_t] - F^* = \kappa + \mathrm{CD}(t).

Each method controls different terms:

VICReg raises $\kappa$ through variance/covariance penalties to prevent collapse, but keeps $\kappa$ small enough that gradient descent can reach $p^*$ . The weights $\sim 0.04 \approx \rho^*/100$ partition the organizational overhead to stay below $\eta_c$ .
DINO controls $\mathrm{CD}(t)$ through momentum timescales. The teacher’s $\tau = 250$ step integration window bounds how far the student can drift from coherent representations. Centering maintains implicit variance floors.
SimCLR reduces $\kappa$ by ensuring the constraint manifold has sufficient geometric coverage. Batch size 4096 provides the negative sample density needed for manifold percolation in $d_{\text{eff}} \approx 9$ effective dimensions.
BYOL reduces $\kappa$ through predictor networks that expand representational capacity. Momentum controls $\mathrm{CD}(t)$ .
JEPA achieves recursive closure at 10-step prediction horizon, crossing the threshold where $\eta < \eta_c$ becomes sustainable for self-modeling representations.

The constraint geometry forces all stable methods toward the same structure. The underlying mathematics is identical across implementations.

When systems must minimize free energy under architectural constraints, they can only succeed by

Keeping structural cost $\kappa$ bounded
Preventing coherence deviation $\mathrm{CD}(t)$ from growing too fast
Maintaining organizational overhead $\eta < \eta_c$

Every SSL method that works does exactly this, whether the designers knew it or not. The constraint geometry forced the convergence.

The Organizational Overhead and Collapse Dynamics

The convergence reveals a deeper constraint: organizational overhead must stay below critical thresholds.

From information physics, any system maintaining structure while processing information carries an organizational charge $\eta$ —the fraction of capacity devoted to maintaining coherence rather than processing new information.

Physical systems show a consistent pattern:

Particles: $\eta \sim 10^{-6}$
Atoms: $\eta \sim 10^{-3}$
Molecules: $\eta \sim 10^{-2}$
Biological systems: $\eta \sim 10^{-1}$
Event horizons: $\eta = 1$

The progression follows a renormalization flow,

\beta(\eta) = -\eta(1-\eta)\left[\rho^* + \frac{d-2}{2}\ln\phi\right],

where $\phi = \frac{1+\sqrt{5}}{2}$ is the golden ratio and $\rho^* \approx 3.29$ .

The critical point appears at

\eta_c = \frac{1}{\rho^*} = 0.304.

Systems operating below $\eta_c$ maintain coherence. Systems crossing this threshold collapse—dimensional reduction near black holes, seizure activity in neural circuits, representational collapse in SSL.

The collapse modes in SSL become interpretable:

Insufficient variance regularization ( $\mathcal{L}_{\text{var}}$ too small): Dimensions die, concentrating overhead in fewer active dimensions. This increases $\eta$ until it crosses $\eta_c$ and the system collapses to a point.
Insufficient momentum ( $m$ too small in DINO): The teacher changes too rapidly, losing its coherence-anchoring function. $\mathrm{CD}(t)$ grows faster than gradient descent can correct, and representations become unstable.
Insufficient negatives (batch size too small in SimCLR): The contrastive geometry fails to cover the manifold. $\kappa$ becomes too large—the constrained optimum sits too far from the ideal. Gradient descent can’t reach it.
Insufficient prediction horizon ( $k < 10$ in JEPA): The system lacks recursive closure. It cannot form stable self-models, leading to $\eta > \eta_c$ in the prediction pathway.

These are the same failures viewed through different lenses. The system violates the constraint geometry and $\eta$ crosses $\eta_c$ in each case.

What Collapse Actually Looks Like

Abstract mathematics becomes concrete when you watch representations die. The constraint geometry predicts both that systems collapse and how they collapse—the temporal signatures, spectral patterns, and geometric transformations that mark the transition from stable learning to catastrophic failure.

Dimensional Death Cascade (Insufficient Variance Control)

Training proceeds normally for 100-200 steps. Loss decreases smoothly. Then the first dimension dies—its variance drops below the noise floor. Within 10-20 steps, a second dimension collapses. Then a third. The cascade accelerates exponentially:

Visual signature: Eigenvalue spectrum develops a sharp cliff. The largest eigenvalue grows while smaller eigenvalues collapse toward zero. Plot the eigenvalue ratio $\lambda_1 / \lambda_{10}$ —healthy training keeps this $< 10$ . In dimensional death cascade, it crosses 100 within 50 steps, then 1000 within another 50 steps.
Loss signature: Training loss continues decreasing (the model can still fit the data with fewer dimensions), but validation metrics diverge. The gap between training and validation loss grows super-linearly. Downstream task performance drops 10-20% even as SSL loss improves.
Representation signature: Embeddings collapse toward a lower-dimensional subspace. Computing the participation ratio $PR = (\sum \lambda_i)^2 / \sum \lambda_i^2$ shows the effective dimensionality. Healthy SSL maintains $PR \geq 0.5 \times d$ where $d$ is nominal dimension. Dimensional death cascade shows $PR$ dropping below $0.1 \times d$ in 100-200 steps.
Recovery: Impossible past 50% dimensional loss. Early intervention (steps 1-30 of cascade) can rescue training by increasing variance regularization 2-5×. Late intervention (steps 50+) requires restart from earlier checkpoint.

This cascade pattern reveals the fragility of unconstrained dimensional compression. When variance regularization fails to maintain the geometric floor, the system enters a runaway collapse where each lost dimension accelerates the death of remaining dimensions. The next failure mode shows a different geometric pathology—not dimensional death but temporal instability.

Teacher-Student Oscillation (Insufficient Momentum)

Training shows characteristic periodic instability. Representations swing between over-fitting recent batches and over-smoothing historical information. The period matches $2\tau$ where $\tau = 1/(1-m)$ is the momentum timescale:

Visual signature: Plot cosine similarity between teacher and student embeddings over time. Healthy training shows slow drift (linear increase from 0.7 to 0.9 over thousands of steps). Oscillation mode shows periodic swings with amplitude 0.1-0.2 and period 100-200 steps when $m = 0.99$ .
Loss signature: Training loss oscillates with the same period as teacher-student similarity. Each cycle: loss decreases for $\tau$ steps, then suddenly jumps 10-30%, then decreases again. The oscillation amplitude grows over time—early training shows 5% swings, late training shows 30%+ swings.
Representation signature: PCA of embeddings over time reveals periodic geometric rotation. The first two principal components trace elliptical paths rather than staying fixed. The ellipse expands over time as oscillation amplitude grows.
Recovery: Increase momentum from $m = 0.99$ to $m = 0.996$ or higher. This increases $\tau$ from 100 to 250 steps, slowing teacher evolution and damping oscillations. Recovery is possible at any point but requires 2-3× the oscillation period to stabilize.

Unlike dimensional collapse which eliminates information capacity permanently, oscillation failure preserves capacity but prevents stable convergence. The system has enough dimensions but lacks the temporal anchoring to settle into coherent geometry. The next mode combines both pathologies—preserved but fragmented structure.

Manifold Fragmentation (Insufficient Negatives)

Representations form disconnected clusters rather than a continuous manifold. The number of clusters scales as $\sqrt{B}$ where $B$ is batch size. For batch size 256, expect 16 clusters. For batch size 64, expect 8 clusters:

Visual signature: t-SNE or UMAP visualization shows distinct islands rather than continuous structure. Clustering coefficient $C = (\text{triangles}) / (\text{connected triples})$ quantifies fragmentation. Healthy contrastive learning shows $C \geq 0.3$ . Fragmentation shows $C < 0.1$ .
Loss signature: Training loss plateaus prematurely. The contrastive objective can’t push clusters apart further without more negative samples. Loss reaches 1.5-2.0 and stops improving, while healthy training would reach 0.5-1.0.
Representation signature: Within-cluster similarity is very high (0.9+) but between-cluster similarity is near zero (< 0.1). This bimodal similarity distribution indicates disconnected manifold components. Healthy training shows unimodal distribution centered around 0.3-0.5.
Recovery: Increase batch size 4× (doubling batch size only improves by $\sqrt{2}$ ). Alternatively, use momentum queues (MoCo-style) to increase effective negatives without memory constraints. Or switch to non-contrastive methods (VICReg, DINO) that don’t require explicit negatives.

Fragmentation represents a subtle failure—local structure remains coherent but global connectivity breaks. Each cluster learns valid representations, but the manifold loses its ability to interpolate between distant regions. The final failure mode operates at yet another scale—not spatial but temporal frequency.

High-Frequency Jitter (Insufficient Prediction Horizon)

Representations oscillate at characteristic frequency $f \approx 1/k$ where $k$ is prediction steps. For 5-step prediction, expect oscillations every 5 steps. For 3-step prediction, every 3 steps:

Visual signature: Fourier transform of embedding norms over time shows sharp peak at $f = 1/k$ . Healthy long-horizon prediction shows flat spectrum. Short-horizon prediction shows spectral peak 10-100× above noise floor.
Loss signature: Prediction loss oscillates with period $k$ . Plot moving average with window size $k$ —healthy training shows smooth decrease. Insufficient horizon shows oscillation amplitude comparable to the trend (signal-to-noise ratio near 1).
Representation signature: Consecutive checkpoints (saved every $k$ steps) show high variability. Computing $\Delta d(t) = \|q_t - q_{t-k}\|$ for consecutive checkpoints gives high values (> 0.5 relative distance). Healthy training shows $\Delta d < 0.2$ .
Recovery: Increase prediction horizon $k \geq 10$ steps. This crosses the recursive closure threshold where organizational overhead drops below $\eta_c$ . The high-frequency oscillations disappear within 50-100 steps of adjustment.

These four failure modes—dimensional collapse, temporal oscillation, spatial fragmentation, and frequency jitter—exhaust the ways constraint geometry can break. Each corresponds to violating a different geometric requirement: variance floors, momentum anchoring, manifold connectivity, or recursive closure. The diagnostic protocol below provides systematic tools to detect which constraint is failing.

Common Diagnostic Protocol

When SSL training shows instability, run this diagnostic sequence:

Eigenvalue spectrum: Plot sorted eigenvalues $\lambda_i$ . Check if $\lambda_1/\lambda_{10} > 50$ (dimensional collapse) or if spectrum shows exponential decay $\lambda_i \sim e^{-i/\tau}$ with $\tau < 5$ (too much compression).
Teacher-student similarity: For momentum methods, plot cosine similarity over training. Check for oscillations with period $\sim 100\text{--}200$ steps (momentum too low) or drift $> 0.05$ per 100 steps (momentum too high or learning rate too high).
Participation ratio: Compute $PR = (\sum \lambda_i)^2 / \sum \lambda_i^2$ every 50 steps. If $PR$ drops below $0.3 \times d$ , variance regularization is insufficient.
Clustering coefficient: Sample 1000 embeddings, compute pairwise similarities, threshold at 0.5, and calculate clustering coefficient. If $C < 0.15$ , increase batch size or switch methods.
Spectral frequency analysis: FFT of embedding norms or losses. Sharp peaks indicate characteristic timescales. Match peak frequency to method parameters (momentum timescale, prediction horizon, batch processing period).

This diagnostic protocol translates abstract geometric constraints into concrete monitoring tools. The mathematics predicts the signatures. The signatures predict the failures. The failures guide the interventions.

Design Principles That Follow

The convergence enables principled design rather than empirical search.

For Variance-Based Methods (VICReg-style)

Target effective weights near $\rho^*/100 \approx 0.033$ for variance and covariance terms. This balances collapse prevention against training rigidity. After accounting for batch normalization and scaling, aim for regularization strengths in the range

\lambda_{\text{eff}}, \mu_{\text{eff}} \in [0.025, 0.050].

This range provides sufficient collapse prevention without making training rigid.

For Momentum Methods (DINO-style)

Set momentum to achieve integration timescales around

\tau = \frac{1}{1-m} \approx 250 \text{ steps}.

For $m \in [0.996, 0.9995]$ , this provides sufficient temporal anchoring. Too small ( $m < 0.99, \tau < 100$ ), and the teacher loses coherence. Too large ( $m > 0.9995, \tau > 2000$ ), and adaptation becomes too slow. This timescale balance maintains coherence without sacrificing adaptability.

For Contrastive Methods (SimCLR-style)

Calculate required batch size from effective dimensionality

N_{\text{min}} = \exp\left(\frac{d \tau}{\rho^*}\right) \times \text{safety factor},

where safety factors of 10-100× account for optimization dynamics. For $d = 128, \tau = 0.07$

N_{\text{min}} \approx 15 \times 100 = 1500\text{--}4000.

These batch sizes ensure sufficient negative sample density for stable manifold coverage.

For Predictive Methods (JEPA-style)

Ensure prediction horizon $k \geq 10$ steps for recursive closure. Shorter horizons fail to stabilize. Use predictor networks to expand effective dimensionality

d_{\text{eff}} = d_{\text{encoder}} + d_{\text{predictor}}.

Combine with momentum ( $m \approx 0.996$ ) to anchor coherence. This dual mechanism expands capacity while maintaining stability.

Monitoring System Health

Track the alignment between current representations and recent history

\Delta d(t) = \|q_t - q_{t-\tau}\|,

where $\tau$ is the relevant timescale (e.g., momentum integration window). Rising $\Delta d$ indicates increasing $\mathrm{CD}(t)$ —the system is drifting from coherent structure.

Monitor effective dimensionality through eigenspectrum of the covariance matrix. Rapid eigenvalue decay signals dimensional collapse

\eta_{\text{eff}} = 1 - \frac{\text{effective rank}}{\text{total dimensions}}.

When $\eta_{\text{eff}}$ approaches $0.3$ , the system nears the collapse threshold. These monitoring tools provide early warning signals before catastrophic failure.

Why This Matters for Representation Learning

The constraint geometry framework transforms SSL from empirical art to principled engineering.

New methods don’t require exhaustive hyperparameter search. Start from the constraint equations. Choose how to shape $\kappa$ (variance terms, momentum, negatives, prediction). Ensure $\mathrm{CD}(t)$ dynamics stay bounded. Keep $\eta < \eta_c$ . The “magic numbers” follow from the geometry.

Failure modes become diagnosable. Representations collapse? Check if $\eta_{\text{eff}} > 0.3$ . Training unstable? Measure $\Delta d(t)$ to quantify coherence drift. Insufficient performance? Calculate if $\kappa$ is too large given your architectural constraints.

Cross-domain insights become possible. The same constraint geometry appears in biological neural networks (synaptic homeostasis maintaining variance floors), physical systems (black hole dimensional reduction when $\eta \to 1$ ), and engineered systems (transformer emergence at depth 10-12). The mathematics connects domains that seemed unrelated.

Architectural choices gain theoretical grounding. Why do transformers need 10+ layers for reasoning? Recursive closure. Why does momentum ~0.996 work across so many methods? Timescale separation at $\tau \approx 250$ . Why do contrastive methods need huge batches? Manifold percolation in $d_{\text{eff}}$ dimensions.

The convergence of VICReg, DINO, SimCLR, BYOL, and JEPA wasn’t historical accident. The constraint geometry forced it. The methods work because they obey the mathematics of coherence maintenance under representational constraints—whether the designers knew it or not.

When independent approaches built from completely different intuitions all arrive at the same “magic numbers,” they reveal structure that was there all along, waiting to be recognized. The geometry shaped the methods, not the other way around.

What the Constraint Geometry Predicts

If the convergence reveals genuine geometric necessity rather than historical accident, the constraint equations should make falsifiable predictions about systems not yet built, architectures not yet tested, and failure modes not yet encountered.

Architectural Predictions

Vision Transformers will require depth $D \in [10, 12]$ for stable reasoning. The recursive closure threshold appears at $N = 10$ in the Leibniz efficiency curve. Below this depth, transformers can’t form stable self-modeling representations. Above depth 12, additional layers provide diminishing returns because $\eta$ approaches $\eta_c$ and organizational overhead dominates. Empirical findings confirm: GPT-2 (12 layers), BERT base (12 layers), ViT base (12 layers)—the decade threshold repeats because it measures the minimum recursive depth for coherent abstraction.

Contrastive methods will plateau at batch sizes $N \approx 2^{12} = 4096$ . For typical SSL configurations ( $d = 128, \tau = 0.07$ ), effective dimensionality $d_{\text{eff}} = 9$ requires $N_{\text{crit}} = \exp(9/3.29) \approx 15$ for manifold percolation. Safety factors of 100-300× push this to 1500-4500. SimCLR found 4096. CLIP trained with 32,768 batch size but gained marginal improvements above 8192. MoCo used momentum queues effectively expanding batch size to 65,536 but showed diminishing returns. The constraint geometry predicts the ceiling.

New SSL methods will converge on effective regularization $\eta_{\text{eff}} \approx 0.03\text{--}0.04$ regardless of implementation. Whether through explicit variance terms (VICReg), implicit centering (DINO), negative sample density (SimCLR), or predictor networks (BYOL), any stable method must maintain organizational overhead near $\rho^*/100 \approx 0.033$ . Methods with $\eta_{\text{eff}} < 0.02$ will underfit (insufficient constraint). Methods with $\eta_{\text{eff}} > 0.05$ will be too rigid (excessive constraint). This 0.03-0.04 band is geometric necessity.

Multimodal models will require cross-modal alignment losses scaled by $1/\sqrt{M}$ where $M$ is the number of modalities. Each modality adds dimensional constraints. CLIP (2 modalities: vision + language) needs weaker alignment than ImageBind (6 modalities: image, text, audio, depth, thermal, IMU). The $1/\sqrt{M}$ scaling keeps total organizational overhead $\eta$ bounded as modalities increase. Without this scaling, $\eta$ grows linearly with $M$ and crosses $\eta_c$ around $M = 3\text{--}4$ modalities, causing collapse.

Training Dynamics Predictions

Learning rate warm-up duration must match momentum timescales. For momentum $m = 0.996$ giving $\tau = 250$ steps, warm-up should span $\sim 250\text{--}500$ steps. Shorter warm-up shocks the teacher-student dynamics before temporal anchoring establishes. Longer warm-up wastes compute in suboptimal regions. Empirically, DINO uses 10 epochs warm-up on ImageNet (5000 steps), SimCLR uses 10% of training as warm-up—both align with $\tau$ timescales.

Representational collapse will occur when eigenvalue spectrum develops power-law tail with exponent $\alpha > 2$ . Healthy representations show eigenvalue decay $\lambda_k \sim k^{-\alpha}$ with $\alpha \in [1.5, 2]$ . When $\alpha > 2$ , effective dimensionality drops rapidly and $\eta_{\text{eff}}$ approaches 0.3. This provides early warning 50-100 steps before visible collapse in loss curves.

Optimal checkpoint selection occurs when $\Delta d(t)$ reaches local minimum. Tracking representation drift $\Delta d(t) = \|q_t - q_{t-\tau}\|$ over momentum timescale $\tau$ reveals when the system settles into coherent basins. Local minima in $\Delta d(t)$ correspond to stable geometric configurations—better checkpoint candidates than loss-based selection.

Failure Mode Predictions

Methods violating $\eta < \eta_c$ will collapse in predictable patterns:

Insufficient variance control ( $\mathcal{L}_{\text{var}}$ too weak): Dimensional death cascade where dimensions collapse sequentially rather than simultaneously. First dimension dies → overhead concentrates in remaining dimensions → second dimension dies → cascade accelerates. Time to full collapse: $T_{\text{collapse}} \approx d \cdot \tau$ where $d$ is embedding dimension and $\tau$ is adaptation timescale.
Insufficient momentum ( $m < 0.99$ ): Teacher-student oscillations with period $\sim 2\tau$ . Representations swing between over-fitting to recent batches and over-averaging historical information. Coherence deviation $\mathrm{CD}(t)$ grows as $\sqrt{t}$ rather than staying bounded.
Insufficient negatives (batch size $< N_{\text{crit}}$ ): Manifold fragmentation where representations form disconnected clusters. Number of clusters scales as $\sqrt{B}$ where $B$ is batch size. Doubling batch size reduces fragmentation by $\sqrt{2}$ , explaining why improvements are sublinear in batch size.
Insufficient prediction horizon ( $k < 10$ in JEPA-style methods): Prediction instability with characteristic frequency $f \approx 1/k$ . Short horizons can’t filter noise at timescales longer than prediction window. The system oscillates at frequencies just above $1/k$ , creating high-frequency jitter in learned representations.

Cross-Domain Predictions

Biological neural networks should show $\eta \approx 0.1$ at circuit level. Synaptic homeostasis mechanisms (scaling, metaplasticity) function as variance regularization. The constraint geometry predicts biological systems operate closer to $\eta_c$ than artificial systems (0.1 vs 0.03) because they face stronger computational constraints. Measurements of metabolic overhead in cortical circuits show 10-15% of neural activity devoted to homeostatic regulation—matching predicted $\eta \approx 0.1$ .

Emergent abilities in language models will appear at depth $D \geq 10$ and model scale where organizational overhead per parameter drops below $\eta_c$ . Smaller models can’t achieve $\eta < 0.3$ because parameter-sharing forces higher overhead. The sharp “emergence” represents geometric threshold crossing. Chain-of-thought reasoning emerged in models ≥ 10B parameters with depth ≥ 12 layers precisely because this configuration first achieves sustained $\eta < \eta_c$ .

Information bottleneck methods will find optimal compression ratio $\beta \approx 3.29 = \rho^*$ . The mutual information objective $I(X;Z) - \beta I(Z;Y)$ balances compression against prediction. The constraint geometry predicts $\beta \approx \rho^*$ because this ratio maintains coherence at the edge of the constraint manifold. Empirical studies finding $\beta \in [3, 4]$ for optimal generalization align with geometric prediction.

These predictions are falsifiable. If constraint geometry genuinely governs SSL, these patterns should appear across architectures, datasets, and training regimes with 10-20% precision. If the convergence was historical accident or domain-specific, these predictions will fail. The geometry requires these outcomes.

What the Convergence Reveals

Five methods. Five different intuitions. Variance regularization. Momentum teachers. Contrastive negatives. Predictive networks. Multi-step horizons. Yet they all converge on

Effective regularization strengths $\sim 0.03\text{--}0.04 \approx \rho^*/100$
Momentum parameters $\sim 0.996$ giving timescale $\tau \approx 250$
Batch sizes scaling as $\exp(d_{\text{eff}}/\rho^*)$
Prediction horizons $\geq 10$ steps
Organizational overhead $\eta < 0.304$

The pattern repeats across architectures, datasets, and training regimes with geometric requirements precise within 10-20%. This precision emerges from constraint geometry, not curve fitting.

The deeper insight: self-supervised learning works by discovering representational manifolds that minimize free energy while respecting architectural constraints. The constraint geometry determines what manifolds are possible. Systems that violate the geometry collapse. Systems that respect it discover stable, coherent representations.

This is what the convergence shows. When you push representational systems to learn from unlabeled data while respecting finite capacity, finite compute, and finite depth, the mathematics forces them toward specific solutions. The constraint geometry requires these solutions.

The methods converged because the geometry of representation space doesn’t allow anything else.