# Information Geometry
Once upon a time a student lay in a garden under an apple tree reflecting on the difference between Einstein's and Newton's views about gravity. He was startled by the fall of an apple nearby. As he looked at the apple, he noticed ants beginning to run along its surface. His curiosity aroused, he thought to investigate the principles of navigation followed by an ant. With his magnifying glass, he noted one track carefully, and, taking his knife, made a cut in the apple skin one mm above the track and another cut one mm below it. He peeled off the resulting little highway of skin and laid it out on the face of his book. The track ran as straight as a laser beam along this highway. No more economical path could the ant have found to cover the ten cm from start to end of that strip of skin. Any zigs and zags or even any smooth bend in the path on its way along the apple peel from starting point to end point would have increased its length.
_What a beautiful geodesic,_ the student commented. ![[what_a_beautiful_geodesic.png]]
> Misner, C. W., Thorne, K. S., & Wheeler, J. A. (2017). _Gravitation_. Princeton University Press.
> [ISBN 9780691177793](https://www.amazon.es/dp/0691177791)
## Information as Geometry - a Motivation
Leaning on [[The Cathedral of Mathematical Structure|the Cathedral of Mathematical Structure]]
```
[2025-12-03, 15:14:33] Nati Shenkute: exact
[2025-12-03, 16:46:04] Martin Balodis: have you been to morocco?
[2025-12-03, 16:46:09] Nati Shenkute: yea man
[2025-12-03, 16:46:15] Nati Shenkute: did a roadtrip there, cool place
[2025-12-04, 16:18:38] Nati Shenkute: DUDEDDE
[2025-12-04, 16:18:49] Nati Shenkute: I've completely converted my new model project into geometry
```
You know, sometimes I wonder if there is an intuitive way of solving intractable problems, the messy ones. And that wonder has exposed me to many ways of thinking, but my favorite has to be the information theoretic approach.
In statistics, we are thought to think of statistical concepts such as the variance etc. as properties of this object we call data. Concepts seem distinct, and to me, rather arbitrary. For example, why is the average 1/N? I never really knew why, until I learnt properly about how the Normal distribution is defined.
I think there's a lot of knowledge hidden in the structure of data. I think, first and foremost, the shape and structure of data encodes intuitive information, allowing us to navigate it using our higher most creative abilities -- through imagination. It was during my work that I realised that we can imagine datasets and shape them in certain ways in our minds as a way to generate novel ways of algorithmic navigation.
And I want to share one example with you.
---
## Data as Curved Spacetime
### Signals as Vectors in Lorentzian Space
Firstly, you gotta represent each discrete event or signal as a vector in some coordinate system. Why a vector? That's because you're interested in representing it accurately, precisely and in completeness: a vector with a magnitude g and a direction θ or complex i; a signal that has a certain strength as well as a direction as to what it indicates.
And then you gotta place that signal on a coordinate system so that you can track what it does and understand it further. But then you wonder ...
#### What is a Coordinate System?
A coordinate system is simply a framework for assigning positions to points in space. Think of it as a grid where each point can be uniquely identified.
```
Standard Cartesian Coordinate System
====================================
x²
↑
│
│ • P(x¹, x²)
│ ╱
│ ╱
│ ╱
│╱________→ x¹
O
```
In a standard Cartesian system, the distance between two points follows the Pythagorean theorem:
```
d² = (Δx¹)² + (Δx²)²
```
But here's the thing: **we can define the distance between points however we want**. The choice of how to measure distance—the metric—encodes information about what matters in our space.
In standard geometry, all directions are treated equally. But what if some dimensions carry more information than others? What if movement along one axis tells us more about outcomes than movement along another?
Imagine further. What if you specificy the coordinate system itself in such a way that its geometry is informative? That would be very cool right? But then how would you even do that.
#### Example: The Gaussian Distribution [[Manifold]]
Let's see this coordinate transformation principle in action with something from statistics. Every Gaussian (normal) distribution is defined by two parameters:
```
p(x; μ, σ²) = (1/√(2πσ²)) · exp(-(x-μ)²/(2σ²))
```
where μ is the mean and σ² is the variance.
So the set of all Gaussian distributions forms a 2D manifold, where each point represents a different probability distribution:
```
The Gaussian Manifold
====================
σ ↑
│ ● p(x; μ=2, σ=1)
│ ●
│ ● p(x; μ=0, σ=2)
└──────────→ μ
Each point = a complete probability distribution
```
Now here's the cool part: we can use different coordinates for this same manifold!
**Coordinate System 1: (μ, σ²)**
- μ = mean (center of the distribution)
- σ² = variance (spread)
- Natural interpretation: "center and spread"
**Coordinate System 2: (m₁, m₂)** - using **moments**
- m₁ = E[x] = μ (first moment: the average)
- m₂ = E[x²] = μ² + σ² (second moment: mean of squared values)
These are **different coordinate systems on the same manifold**!
**What are moments?** They're different "snapshots" of a distribution:
- **1st moment** (m₁): E[x] = where is the center?
- **2nd moment** (m₂): E[x²] = how much total "mass" from zero?
- **3rd moment** (m₃): E[x³] = is it symmetric or lopsided?
- **4th moment** (m₄): E[x⁴] = how heavy are the tails?
The transformation between coordinate systems:
```
(μ, σ²) → (m₁, m₂):
m₁ = μ
m₂ = μ² + σ²
(m₁, m₂) → (μ, σ²):
μ = m₁
σ² = m₂ - m₁²
```
Just like Cartesian ↔ Polar coordinates for a plane, these are two ways to navigate the same geometric space. The choice of coordinates affects how we measure "distance" between distributions!
This is the first hint that **statistics is secretly geometry** - probability distributions live on manifolds, and we can choose different coordinate systems to navigate them. The moments are coordinates that emerge naturally from the distribution's structure itself.
> Amari, S. (2016). _Information Geometry and Its Applications_. Springer.
> [ISBN 978-4431559771](https://www.amazon.es/Shun-ichi-Amari/dp/4431559779)
#### The Lorentzian?
Firstly you gotta impose a rule of reality: in that everything propagates forward in time steps that are causally linked. And for that you take your coordinate system and impose your first restriction, impose the Lorentzian.
In Lorentzian geometry, the metric signature incorporates time with a negative sign, creating light cones and causality—past events can influence the future, but not vice versa. Earlier states inform later outcomes, with no retroactive influence.
The line element is:
```
ds² = -c² dτ² + g_ij dx^i dx^j
```
- τ: Proper time (progression through process stages)
- c: "Speed" parameter (rate of state evolution)
- g_ij: Spatial metric (information-theoretic distances between states)
```
Lorentzian Light Cone (Causal Structure)
========================================
Future (τ > 0: Possible Outcomes)
╱│╲
╱ │ ╲
╱ │ ╲ ← Reachable states
╱ │ ╲
/ │ \
───────────── (Event: Signal at τ=0)
\ │ /
╲ │ ╱
╲ │ ╱ ← Unreachable (cannot influence past)
╲ │ ╱
╲│╱
Past (τ < 0: Historical Context)
```
> Veritasium. Something Strange Happens When You Follow Einstein's Math, YouTube
> [watch?v=6akmv1bsz1M](https://www.youtube.com/watch?v=6akmv1bsz1M)
Events inside the future cone can reach terminal states; outside, they decay causally.
### The Signal Vector and Its Abode
Each signal s at timestamp t is a vector:
```
V_s = (magnitude, direction, timestamp)
- magnitude: signal strength/relevance
- direction: signal characteristics in (e.g.) feature space
- timestamp: position in temporal reference frame
```
#### Invariance and Mutual Information
Here's a crucial property: vectors carry **invariance** through their mutual information with outcomes. What does this mean?
Mutual information I(X;Y) measures how much knowing X tells you about Y:
```
I(X;Y) = Σ Σ p(x,y) log₂(p(x,y)/(p(x)p(y)))
```
When I(X;Y) = 0, X and Y are independent—knowing X tells you nothing about Y. When I(X;Y) is high, X strongly predicts Y.
The key insight: **mutual information is invariant under transformations that preserve the underlying relationship**. Even as we transform our coordinate system or change reference frames, the fundamental information content—how much a signal tells us about the outcome—remains constant.
This is analogous to how in relativity, certain quantities (like proper time) remain invariant even as coordinates change between reference frames. The information content of a signal is a fundamental property, independent of how we choose to represent it.
```
Invariance of Information Content
=================================
Frame 1: Frame 2:
Signal representation Signal representation'
↓ ↓
V = [0.8, 45°, t₁] V' = [0.5, 30°, t₂]
↓ ↓
I(V; outcome) = 0.28 ←──────→ I(V'; outcome) = 0.28
(Preserved)
```
### Vector Compression to Geodesic
Now here's where things get interesting. Multiple signals compress into a single net vector, forming a geodesic path through the state manifold.
But why does compression happen? And what indicates that we're dealing with a manifold in the first place?
#### Heteroscedasticity as Evidence of Higher Dimensions
When you observe variance in your data that isn't uniform—heteroscedasticity—it's often a signal that your flat representation is actually a projection of something higher-dimensional: a curved manifold.
Think of it this way: imagine you're looking at the shadow of a sphere on a wall. Near the edges, small movements of the sphere create large changes in the shadow. Near the center, the same movements create smaller changes. The non-uniform variance in the shadow reveals that it's a projection of a curved 3D object.
```
Heteroscedasticity as Manifold Signature
========================================
Flat view (what you observe):
│ scatter │ tight │ scatter │
│ ∴ ∴ ∴ │ ∴∴ │ ∴ ∴ ∴ │
│ ∴ ∴ ∴ │ ∴∴ │ ∴ ∴ ∴ ∴ │
└───────────┴───────┴───────────────┘
Region A Region B Region C
Curved reality (underlying manifold):
╱─────╲
╱ B ╲ ← Low variance: you're looking
│ ∴∴ │ straight at the surface
A╲ ∴∴ ╱C
∴ ╲─────╱ ∴ ← High variance: oblique view
∴ ∴ ∴ ∴ of curved surface
```
Similarly, in your data, when different combinations of signals lead to different amounts of outcome variance, it suggests you're observing projections of paths on a curved information manifold.
#### Geodesics: Natural Paths on Curved Surfaces
A geodesic is the straightest possible path on a curved surface—the path that extremizes distance (usually minimizes it). On a flat plane, geodesics are straight lines. On a sphere, they're great circles. On a curved information manifold, they're the natural trajectories that data follows.
When you compress multiple signals into a net vector, you're finding the geodesic that best represents the overall trajectory through your information space.
Compression via weighted sum:
```
V_net = Σ w_i · V_i
w_i ∝ I(V_i; outcome) (mutual information with terminal state)
```
The weights are proportional to information content—signals that tell you more about outcomes contribute more to the geodesic.
```
Signal Vector Compression
=========================
Mini-Vectors (Raw Signals)
→ [Signal A] (weak relevance, w=0.1)
↗ [Signal B] (medium relevance, w=0.3)
→ [Signal C] (strong relevance, w=0.6)
↓ (Compression: MI-weighted sum)
Net Vector (Geodesic Input)
───↗ [V_net] (integrated magnitude/direction)
↓ (Initiates worldline through manifold)
```
### Phase Transitions: From Signals to States
Now we encounter a phase shift—a transformation from one type of entity to another. In our framework, this is the transition from raw signals to opportunity states, from potential to actualized.
A phase transition in physics occurs when a system changes its fundamental character—water to ice, for example. Here, we have an analogous transition: when accumulated signals cross a threshold, they crystallize into a new state with its own dynamics.
```
Phase Transition: Signal → Opportunity
======================================
Signal Space (τ < 0): Opportunity Space (τ ≥ 0):
Diffuse potential Crystallized state
∴ ∴ ∴ ┌─────┐
∴ ∴ ∴ ∴ → │ ● │
∴ ∴ ∴ └─────┘
(Many weak signals) (Single opportunity)
V₁, V₂, V₃, ... V_net → O
```
This transition is marked by τ = 0 in our coordinate system. Before this point, we have a cloud of potential signals. After it, we have a definite state on the manifold that will follow its own geodesic toward terminal attractors.
The same pattern repeats at each stage: signals accumulate, cross a threshold, phase-transition into a new state, which then generates new signals that accumulate toward the next transition.
### The Process as Worldline
The complete sequential process forms a geodesic worldline—a trajectory through information spacetime from initial state to terminal outcome:
```
○ → □ → □ → □ → ◇ → ⊕
- ○ : Signal compression point (phase transition)
- □ : Process stages (manifold unfolding)
- ◇ : Bifurcation point (branch to attractors)
- ⊕ : Terminal attractor
```
This worldline satisfies causality: no loops backward in τ.
```
Process Worldline (Geodesic Trajectory)
=======================================
τ=0: Signal Compression → Phase Transition
○──────────┐
│ (Natural flow along geodesic)
τ=1: Stage 1
□──────────┐
│ (Branch possible)
τ=2: Stage 2
□──────────┐
│ (Continued evolution)
τ=3: Stage 3
□──────┐
│ (Approaching attractor)
τ=4: Terminal
◇─────⊕ [Outcome A]
├─────⊕ [Outcome B]
└─────⊕ [Outcome C]
```
At each stage, the same geometric principles apply: the system follows the geodesic defined by the manifold's curvature (which encodes information structure) toward attractors (which represent stable outcomes).
## The Coordinate System
### Proper Time τ: Stage Transitions
Proper time τ measures progression, invariant across frames:
```
τ = 0: Initial signals → Phase transition
τ = 1: Stage 1
τ = 2: Stage 2
τ = 3: Stage 3
τ = 4: Terminal state
```
Each increment is a causal step—higher τ cannot precede lower.
### Spatial Coordinates: Information Content
Spatial dimensions emerge from data via mutual information ranking. The coordinates are not chosen arbitrarily but discovered through the data's natural information structure.
For features X and outcome Y:
```
I(X;Y) = Σ Σ p(x,y) log₂(p(x,y)/(p(x)p(y)))
```
Ranked basis:
```
x¹ = argmax I(X₁;Y)
x² = argmax I(X₂;Y) (excluding X₁)
...
```
```
Coordinate Emergence via MI
==========================
Feature Pool MI Ranking Natural Basis
----------- ---------- -------------
[Feature A] I=0.42 → x¹ = Feature A
[Feature B] I=0.28 → x² = Feature B
[Feature C] I=0.21 → x³ = Feature C
[Feature D] I=0.12 → x⁴ = Feature D
... ... (lower I features compressed)
Result: Data-chosen coordinates, not human-imposed
```
### Metric Tensor: Anisotropic Information Space
The metric weighs dimensions by information content:
```
ds² = Σ g_ii (dx^i)²
g_ii ∝ I(x^i; Y)
```
High-MI axes are "stretched"—movement along them carries more geometric weight.
```
Anisotropic Metric Structure
============================
High-MI Axis (x¹: Stretched)
═══════════════════════════════ (g_{11} large)
Medium-MI Axis (x²: Moderate)
═══════════════ (g_{22} medium)
Low-MI Axis (x³: Compressed)
═══════ (g_{33} small)
Level Set (Equal ds²):
═══════
║ ║
║ ● ║ ← Ellipsoid: Distance non-uniform
║ ║
═══════
```
## Attractor Structure
### Terminal Attractors
Attractors are stable points detected by the geometry—terminal states where trajectories naturally converge. In any sequential decision process, these represent the possible final outcomes.
These aren't arbitrary endpoints we've chosen; they emerge from the data's structure. Regions where many trajectories converge, where the gradient field has stable equilibria, naturally become attractors.
### Gradient Fields
Each attractor generates a potential:
```
Φ_k(x) = -w_k / ||x - x_k||
```
Total field curves space:
```
F^μ = -∇Φ_total
```
```
Attractor Potential Field
=========================
[Attractor A] [Attractor B] [Attractor C]
● ● ●
╱│╲ ╱│╲ ╱│╲
╱ │ ╲ ╱ │ ╲ ╱ │ ╲
╱ │ ╲ ╱ │ ╲ ╱ │ ╲
╱ ↓ ╲ ╱ ↓ ╲ ╱ ↓ ╲
/ │ \ / │ \ / │ \
───────────── ───────────── ─────────────
(Deep basin) (Saddle pt) (Shallow sink)
Gradient: Steep toward A, unstable at B, gradual to C
```
## Geodesic Dynamics and Optimization
### The Geodesic Equation
Trajectories follow:
```
d²x^μ/dτ² + Γ^μ_αβ (dx^α/dτ)(dx^β/dτ) = F^μ
```
- Γ from metric derivatives (curvature encoding information structure)
- F as interventions (deflecting natural paths)
### Optimizing for Stable Geodesics
Here's the ultimate goal: find the optimal, stable geodesic that leads to your desired attractor.
A stable geodesic is one that:
1. Minimizes the information-theoretic "distance" traveled
2. Remains robust to perturbations (small changes don't drastically alter the trajectory)
3. Naturally flows toward the desired attractor
#### Natural vs Forced Paths
When F=0, you have a pure geodesic—the path data naturally follows given the curvature of information space. This is what happens without intervention.
When F≠0, you apply force—interventions that bend the trajectory toward a desired attractor. But here's the key: you want to find interventions that create stable geodesics, not ones that require constant correction.
```
Geodesic Optimization
====================
Natural Path (F=0):
●───────●───────●───────● ──→ [Attractor B/C]
(Suboptimal)
Heavy Intervention (large F):
●───────●
↑↑↑ F (Constant force required)
│││
●───●───● ──→ [Attractor A]
(Unstable: requires continuous input)
Optimal Intervention (minimal F):
●───────●─────────┐
↑ F (Single well-placed nudge)
│
●───────●───────● ──→ [Attractor A]
(Stable: geodesic naturally continues)
```
The optimal strategy is to find where a small force creates maximum curvature change—where you can nudge the trajectory onto a geodesic that naturally flows to your desired attractor.
This is geometric optimization: instead of fighting against the natural flow of information, you understand the manifold's structure and make minimal interventions at maximum leverage points.
## Relativistic Effects: Context Evolution as Reference Frames
When the underlying system changes over time (e.g., product evolution, market shifts, policy changes), signals from past "epochs" must be transformed to the current reference frame.
Transformation:
```
V' (F₂, t₂) = Λ(F₁→F₂) · V (F₁, t₁)
Λ_ij ∝ Δsystem_state / Δt (evolution rate as "velocity")
```
Invariant: I(V; outcome) preserved across frames.
```
Relativistic Frame Shift
========================
Frame F1 (System State: Early)
[Signal X] → Magnitude=0.8, Direction=θ=45°
↓ Λ (High velocity: Rapid evolution)
Frame F2 (System State: Mature)
[Signal X]' → Magnitude'=0.5 (dilated), θ'=30° (contracted)
Invariant: I = 0.28 bits (info content unchanged)
```
The geometry ensures causal, frame-consistent propagation.
---