SGD in Nature:
Close Analogues to AI Algorithms

Evidence that biological & physical processes already exploit SGD‑like dynamics at an abstract level

Stochastic Gradient Descent shares the same stochastic‑drift form that underlies evolution, neural adaptation, and thermodynamic relaxation.

These natural algorithms showcase how noisy, local updates can reliably discover globally robust solutions in high‑dimensional landscapes.

1 ◆ Evolutionary Search ≈ "Fitness‑SGD"

  • Gradient Source – Relative fitness selects allele changes that ascend the fitness gradient, on average, [5,6].
  • Noise – Mutation & drift maintain exploration; large pop. ↔ small batch size trade‑off.
  • Flat Peaks – Broad adaptive plateaus are often found to dominate rugged genotypic spaces in many empirical systems, mirroring flat minima that generalize in DNNs [7].
  • Formal link – Adaptive walk dynamics converge to SDE: dθ = ∇F·dt + √Σ·dWt (same form as SGD) [8].
Schematic illustration of evolutionary landscape and adaptive walk

2 ◆ Neural Plasticity ≈ Local Gradient Descent

  • Gradient Source – Hebbian + STDP rules can be framed as gradients of free‑energy under predictive coding [3].
  • Experimental Evidence – In‑vitro mouse cortical cultures follow predicted gradient flow when exposed to structured stimuli [9].
  • Global Signal – Dopamine reward prediction error ≈ global scalar loss informing weight updates (RL‑style SGD) [10].
  • Credit Assignment – Approximated through dendritic segregation & feedback pathways – biological "backprop" surrogates [11, 4].
Schematic of a neuron synapse illustrating plasticity

3 ◆ Protein Folding & Over‑Damped Langevin Dynamics

  • Gradient Source – Conformational search obeys dX = −∇U·dt + √(2β−1 (β=1/kT))·dWt – identical SDE to SGD with ηβ−1 [12].
  • Landscape – Funnel‑shaped landscapes explain fast convergence despite astronomical state space – same empirical observation for over‑param nets [13, 16].
  • Temperature Scheduling – Simulated annealing & cyclical learning‑rates both exploit temperature schedules to cross barriers.
  • Experiment – Single‑molecule folding trajectories exhibit barrier hopping rates matching SGD escape statistics [14].
Illustration of a protein folding energy funnel

◆ Nonequilibrium Thermodynamics of SGD

  • Entropy Flow – SGD maintains a steady‑state entropy production: ⟨ΔS⟩ ≈ η·Var[g] (approx.) – obeys fluctuation theorems [15].
  • Effective Temperature – Teffη·(Bfull/Bmini) links batch size to exploration radius.
  • Flatness – Flatter minima correlate with lower stationary entropy – predicting better generalization [1,2].

✖ Non‑Analogues (Evidence Against)

  • Hamiltonian Mechanics – Conservative, time‑reversible ⇒ no descent (unless friction is added, leading to gradient flow).
  • Closed Quantum Evolution – Unitary, preserves entropy; requires decoherence or baths to approximate SGD.
  • Pure Random Walk – Lacks drift term ⇒ polynomially slower search vs. SGD's biased walk.

◆ Design Lessons Transferred to Machine Learning

  • Temperature Control – Adapt η & batch size like annealing schedules.
  • Population diversity – Ensembles, population‑based training (PBT) & hyper‑param. search exploit evolutionary breadth.
  • Local+Global Signals – Merge Hebbian locality with global error modulators (e.g. feedback alignment).
  • Barrier Hopping – Inject noise bursts / sharpness‑aware steps to exit narrow valleys.
  • Energy‑Based Regularization – Entropy‑SGD, SAM explicitly penalize sharp minima → echoes natural robustness.

Conclusion ▶

The core mathematical structure of SGD—a biased stochastic process with drift and diffusion—has natural analogues across biology, neuroscience, and statistical physics. These domains exploit noise to improve search, ensure robustness, and generalize across uncertain landscapes—mirroring why SGD works so well in machine learning. This convergence offers a framework for new algorithm design based on nature's time-tested principles.

[1] Cohen T. et al., "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability", NeurIPS 2021.

[2] Chaudhari P. et al., "Entropy‑SGD: Biasing Gradient Descent Into Wide Valleys", ICLR 2017.

[3] Friston K., "The Free‑Energy Principle: A Unified Brain Theory?", Nat. Rev. Neurosci. 2010.

[4] Lillicrap T. et al., "Backpropagation and the Brain", Nat. Rev. Neurosci. 2020.

[5] Orr H.A., "The Genetic Theory of Adaptation: A Brief History", Nat. Rev. Genet. 2005.

[6] Kryazhimskiy S., "Global Epistasis Makes Adaptation Predictable Despite Sequence‑Level Stochasticity", Science 2014.

[7] Izmailov P. et al., "Averaging Weights Leads to Wider Optima and Better Generalization", NeurIPS 2018.

[8] Mustonen V., Lässig M. (2009) "From Fitness Landscapes to Seascapes: Non‑Equilibrium Dynamics of Selection and Adaptation". Trends Genet. 25 (3): 111–119.

[9] Isomura T., Friston K. (2018) "In Vitro Neural Networks Minimise Variational Free Energy". Sci. Rep. 8, 16532.

[10] Schultz W., "Dopamine Reward Prediction Error Coding", Annu. Rev. Neurosci. 2016.

[11] Guerguiev J. et al. (2017) "Towards Deep Learning with Segregated Dendrites". eLife 6:e22901.

[12] Zwanzig R., "Diffusion in a Rough Potential", PNAS 1988.

[13] Bryngelson & Wolynes, "Funnels, Pathways and the Energy Landscape of Protein Folding", Proc. Natl. Acad. Sci. 1995.

[14] Neupane K. et al., "Protein Folding Trajectories Can Be Described Quantitatively by 1‑D Diffusion…", Nature Physics 2016.

[15] Sohl‑Dickstein J. et al., "Deep Unsupervised Learning Using Nonequilibrium Thermodynamics", ICML 2015.

[16] Li H., Xu Z., Taylor G., Studer C., Goldstein T., "Visualizing the Loss Landscape of Neural Nets", NeurIPS 2018.

Information derived from discussions with OpenAI's GPT-o3, Deep Research, and Google Gemini 2.5-pro AIs: https://chatgpt.com/share/681af0f3-364c-8003-8f01-720cd41c61a0