SGD in Nature:
Close Analogues to AI Algorithms
Evidence that biological & physical processes already exploit SGD‑like dynamics at an abstract level
Stochastic Gradient Descent shares the same stochastic‑drift form that underlies evolution, neural adaptation, and thermodynamic relaxation.
These natural algorithms showcase how noisy, local updates can reliably discover globally robust solutions in high‑dimensional landscapes.
1 ◆ Evolutionary Search ≈ "Fitness‑SGD"
- Gradient Source – Relative fitness selects allele changes that ascend the fitness gradient, on average, [5,6].
- Noise – Mutation & drift maintain exploration; large pop. ↔ small batch size trade‑off.
- Flat Peaks – Broad adaptive plateaus are often found to dominate rugged genotypic spaces in many empirical systems, mirroring flat minima that generalize in DNNs [7].
- Formal link – Adaptive walk dynamics converge to SDE: dθ = ∇F·dt + √Σ·dWt (same form as SGD) [8].
2 ◆ Neural Plasticity ≈ Local Gradient Descent
- Gradient Source – Hebbian + STDP rules can be framed as gradients of free‑energy under predictive coding [3].
- Experimental Evidence – In‑vitro mouse cortical cultures follow predicted gradient flow when exposed to structured stimuli [9].
- Global Signal – Dopamine reward prediction error ≈ global scalar loss informing weight updates (RL‑style SGD) [10].
- Credit Assignment – Approximated through dendritic segregation & feedback pathways – biological "backprop" surrogates [11, 4].
3 ◆ Protein Folding & Over‑Damped Langevin Dynamics
- Gradient Source – Conformational search obeys dX = −∇U·dt + √(2β−1 (β=1/kT))·dWt – identical SDE to SGD with η↔β−1 [12].
- Landscape – Funnel‑shaped landscapes explain fast convergence despite astronomical state space – same empirical observation for over‑param nets [13, 16].
- Temperature Scheduling – Simulated annealing & cyclical learning‑rates both exploit temperature schedules to cross barriers.
- Experiment – Single‑molecule folding trajectories exhibit barrier hopping rates matching SGD escape statistics [14].
◆ Nonequilibrium Thermodynamics of SGD
- Entropy Flow – SGD maintains a steady‑state entropy production: ⟨ΔS⟩ ≈ η·Var[g] (approx.) – obeys fluctuation theorems [15].
- Effective Temperature – Teff ∝ η·(Bfull/Bmini) links batch size to exploration radius.
- Flatness – Flatter minima correlate with lower stationary entropy – predicting better generalization [1,2].
✖ Non‑Analogues (Evidence Against)
- Hamiltonian Mechanics – Conservative, time‑reversible ⇒ no descent (unless friction is added, leading to gradient flow).
- Closed Quantum Evolution – Unitary, preserves entropy; requires decoherence or baths to approximate SGD.
- Pure Random Walk – Lacks drift term ⇒ polynomially slower search vs. SGD's biased walk.
◆ Design Lessons Transferred to Machine Learning
- Temperature Control – Adapt η & batch size like annealing schedules.
- Population diversity – Ensembles, population‑based training (PBT) & hyper‑param. search exploit evolutionary breadth.
- Local+Global Signals – Merge Hebbian locality with global error modulators (e.g. feedback alignment).
- Barrier Hopping – Inject noise bursts / sharpness‑aware steps to exit narrow valleys.
- Energy‑Based Regularization – Entropy‑SGD, SAM explicitly penalize sharp minima → echoes natural robustness.
Conclusion ▶
The core mathematical structure of SGD—a biased stochastic process with drift and diffusion—has natural analogues across biology, neuroscience, and statistical physics. These domains exploit noise to improve search, ensure robustness, and generalize across uncertain landscapes—mirroring why SGD works so well in machine learning. This convergence offers a framework for new algorithm design based on nature's time-tested principles.
[1] Cohen T. et al., "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability", NeurIPS 2021.
[2] Chaudhari P. et al., "Entropy‑SGD: Biasing Gradient Descent Into Wide Valleys", ICLR 2017.
[3] Friston K., "The Free‑Energy Principle: A Unified Brain Theory?", Nat. Rev. Neurosci. 2010.
[4] Lillicrap T. et al., "Backpropagation and the Brain", Nat. Rev. Neurosci. 2020.
[5] Orr H.A., "The Genetic Theory of Adaptation: A Brief History", Nat. Rev. Genet. 2005.
[6] Kryazhimskiy S., "Global Epistasis Makes Adaptation Predictable Despite Sequence‑Level Stochasticity", Science 2014.
[7] Izmailov P. et al., "Averaging Weights Leads to Wider Optima and Better Generalization", NeurIPS 2018.
[8] Mustonen V., Lässig M. (2009) "From Fitness Landscapes to Seascapes: Non‑Equilibrium Dynamics of Selection and Adaptation". Trends Genet. 25 (3): 111–119.
[9] Isomura T., Friston K. (2018) "In Vitro Neural Networks Minimise Variational Free Energy". Sci. Rep. 8, 16532.
[10] Schultz W., "Dopamine Reward Prediction Error Coding", Annu. Rev. Neurosci. 2016.
[11] Guerguiev J. et al. (2017) "Towards Deep Learning with Segregated Dendrites". eLife 6:e22901.
[12] Zwanzig R., "Diffusion in a Rough Potential", PNAS 1988.
[13] Bryngelson & Wolynes, "Funnels, Pathways and the Energy Landscape of Protein Folding", Proc. Natl. Acad. Sci. 1995.
[14] Neupane K. et al., "Protein Folding Trajectories Can Be Described Quantitatively by 1‑D Diffusion…", Nature Physics 2016.
[15] Sohl‑Dickstein J. et al., "Deep Unsupervised Learning Using Nonequilibrium Thermodynamics", ICML 2015.
[16] Li H., Xu Z., Taylor G., Studer C., Goldstein T., "Visualizing the Loss Landscape of Neural Nets", NeurIPS 2018.
Information derived from discussions with OpenAI's GPT-o3, Deep Research, and Google Gemini 2.5-pro AIs: https://chatgpt.com/share/681af0f3-364c-8003-8f01-720cd41c61a0
SGD in Nature:
Close Analogues to AI Algorithms
Big idea: the way we train A.I. looks a lot like how nature learns and settles down.
What is SGD? ▶ A simple "guess, check, tweak" loop that keeps nudging numbers until mistakes get smaller.
Surprise — evolution, brains, and even folding proteins follow very similar guess‑and‑tweak patterns!
1 ◆ Evolution: Nature's Long Game
- How it learns — Random gene changes happen; the useful ones help creatures survive and spread.
- Why noise matters — Mutations are nature's way of "trying new ideas."
- Big picture — Over time, species end up on wide, safe "fitness hills," just like A.I. finds wide, safe solutions.
2 ◆ Brains: Tiny Tweaks Between Neurons
- How it learns — Neurons that "fire together, wire together." Stronger wires = better future guesses.
- Good job signal — A dopamine squirt says "Yes, that was right!" and locks in the change.
- Like A.I. — This is a natural version of sending an error signal back through a network.
3 ◆ Proteins: Wiggle Until They Fit
- How it learns — A floppy chain wiggles randomly, sliding downhill in energy, until it snaps into a snug shape.
- Why it works — The energy landscape is shaped like a funnel, guiding the chain home.
- Same vibe — Our algorithm also wiggles (adds little random nudges) to avoid getting stuck in bad spots.
◆ Jiggly Physics Behind the Scenes
- Always moving — SGD never fully sits still; tiny jostles keep it exploring.
- Hot vs. cold — Big learning rate = hot & adventurous, small learning rate = cool & careful.
- Gentle valleys — Settling in a wide valley means the solution still works if things change a bit.
✖ Where the Analogy Breaks
- Perfect pendulums — A frictionless swing never stops, so no learning there.
- Pure chance — Wandering aimlessly without feedback is super slow.
◆ Tips We Borrow for Better A.I.
- Start hot, cool down — Like cooling metal, we use big steps then tiny steps.
- Keep a crowd — Training several models at once (a "population") helps find stronger ideas.
- Mix signals — Local tweaks plus an occasional big thumbs‑up signal work great.
Wrap‑up ▶
Whether it's genes, brain cells, or proteins, nature wins by "try, test, and tweak." A.I.'s SGD is basically the same trick — meaning we can learn plenty from the world around us.