The Geometry of Information

6 minute read

Added:

All of statistical inference boils down to looking at data through the lens of a model and trying to learn something about the world from it. But how much can the data actually tell reveal about latent structural parameters that govern the data generating process? This is crucial, since estimation of cause and effect often relies on uncovering the value of these parameters.

Fisher information provides a precise answer to these questions. It quantifies the amount of information that data carry about a quantity of interest, like a parameter in a model. In doing so, it reveals the very geometry underlying the process of inference: how clearly the data can distinguish between nearby hypotheses.

If an observation can just as well be explained by Hypothesis 1 or Hypothesis 2, the observation is uninformative about what the correct hypothesis is. In the case of parametric models, hypothesis can be formulated simply in terms of different numerical values of the parameters and hence the question becomes whether or not the observation carries any meaningful information about the parameters.

Inference in a Simple World

Consider a simple model where $y$ denotes a noisy measurement about a state $\theta \in \mathbb{R}$ (scalar) given by:

\[y=\theta + \epsilon, \quad \epsilon \sim N(0,\sigma^2)\]

Under Gaussian noise, the log-likelihood of observing $y$ given $\theta$ is given by: \(\log p(y|\theta) = -\frac{(y-\theta)^2}{2\sigma^2} + \text{const.}\)

The Fisher information matrix of the observation, conditional on the model is given by:

\[I(\theta) = \mathbb{E}_{y} [ \frac{\partial^2 \log p(y|\theta)}{\partial \theta^2}]\]

Fisher information matrix represents how much information an observation is expected to carry about a parameter(s) of interest ($\theta$ in this case). Mathematically, it is equal to the expected outer product of the score function (the gradient of the log-likelihood).

In this simple model, the Fisher information matrix is simply equal to: \(I(\theta) = \frac{1}{\sigma^2}\)

Intuitively, this model shows that as the noise variance increases, the information in the data decreases — the observations become less able to distinguish between nearby parameter values. The observation would not reduce uncertainty by much.

In case the observation model was: \(y=a\theta + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\)

where $a>0$ is a constant scalar mapping from the true state to the signal.

The information matrix would then be: \(I(\theta) = \frac{a^2}{\sigma^2}\)

Larger the magnitude of $a$, the larger the signal-to-noise ratio in the data: the more effective data is at reducing our uncertainty about $\theta$. We can see that in the linear Gaussian case, the information matrix is independent of the state.

Example 1: Consider a noisy nonlinear signal $y$, about a fundamental $x$: \(y = \sin(x) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\) The sensitivity of $y$ to changes in $x$ is captured by the derivative $\frac{dy}{dx} = \cos(x)$. The Fisher information with respect to $x$ is therefore \(\mathcal{I}(x) = \frac{1}{\sigma^2}\left(\frac{d\sin(x)}{dx}\right)^2 = \frac{\cos^2(x)}{\sigma^2}.\)

This means that information is not uniform across the state space:

  • Around $x=0$, where $\cos(x)=1$, the signal is very sensitive to small changes in $x$; information is high.
  • Around $x=\pi/2$, where $\cos(x)=0$, the signal flattens out; changes in $x$ barely affect $y$, so information is low.

The geometry of information here reflects how the nonlinear transformation $\sin(x)$ stretches or compresses the space of distinguishable signals. Some regions of the state space carry sharp, detailed information; others are nearly flat and uninformative.

Example 2: Consider a signal that is a nonlinear function of two state variables $x>0$ and $y>0$:

\(s = f(x, y) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2),\) where \(f(x, y) = \frac{x^2}{x + y}.\) Here, the numerator grows quadratically in $x$ while the denominator introduces a coupling between $x$ and $y$. This simple form already produces rich geometry: sensitivity depends sharply on the relative balance between the two state variables.

The gradient (Jacobian) of $f$ with respect to $(x, y)$ is \(\nabla f(x, y) = \left( \frac{\partial f}{\partial x}, \; \frac{\partial f}{\partial y} \right) = \left( \frac{x(x + 2y)}{(x + y)^2}, \; -\frac{x^2}{(x + y)^2} \right).\)

The first component shows how the signal responds to small changes in $x$, while the second shows the (always negative) sensitivity to $y$.
Because $f$ depends on both variables in the ratio $x/(x+y)$, these partial derivatives are highly state-dependent and can even change sign.

For additive Gaussian noise, the Fisher Information Matrix is \(\mathcal{I}(x, y) = \frac{1}{\sigma^2} \nabla f^\top \nabla f,\) which yields \(\mathcal{I}(x, y) = \frac{1}{\sigma^2 (x + y)^4} \begin{bmatrix} x^2 (x + 2y)^2 & -x^3(x + 2y)\\ x^3(x + 2y) & x^4 \end{bmatrix}.\) This $2\times2$ matrix has \textbf{rank one}, because it is the outer product of a vector with itself.
Hence, there is only one independent direction in the $(x, y)$ state space that affects the signal—information is concentrated along this single direction, and orthogonal directions are locally unobservable.

The geometry of information for this signal is highly anisotropic:

  • The signal is most informative when the denominator $(x + y)$ is small, where small changes in state cause large changes in $f(x, y)$: the signal becomes information-poor when $y»x$, since the sensitivity of $f$ to either state diminishes
  • The cross-terms capture statistical coupling between $x$ and $y$: changes in one state cannot be fully disentangled from changes in the other - uncertainties are coupled

Geometrically, the information manifold looks like a narrow ridge in the $(x, y)$ plane. Along the ridge, small state perturbations lead to sharp, distinguishable changes in the signal. In directions orthogonal to it, the signal remains almost constant, and the system is locally unobservable.

Rank of the Fisher Matrix and Dimensionality of Information

The rank of the Fisher Information Matrix tells us the dimension of the locally identifiable subspace. In the previous example: \(\text{rank}(\mathcal{I}) = 1.\) Only one combination of $(x, y)$ is statistically distinguishable from the data.
In filtering terms, the observation provides information about a single mode of the state, while uncertainty in the orthogonal (uninformative) direction must evolve under the system dynamics alone. If we had two linearly independent signals, the Fisher information would have full-rank. If signals are not linearly independent, then there is redundancy. But redundant signals can still reduce uncertainty by reinforcing each other. They do not however carry new qualitative information.

Intuitively, think of trying to find a location of a post-office in a city. You ask two people for directions:

  • Redundant Signals: Two people point along the same direction — “the post-office is somewhere along this road.” Hearing it twice makes you more certain you’re on the right road, but the second piece of information doesn’t help you (on the margin) narrow it down further.
  • Independent signals: One person says “it’s north of the Capitol,” the other says “it’s east of the river.” Now you can triangulate — you’ve gained information along two different axes.