The Geometry of Information
Added:
The Pitch
All of statistical inference boils down to looking at data through the lens of a model and trying to learn something about the world from it. But how much can the data actually tell us? How do we know whether our observations contain meaningful information about the particular parameter we care about — the one tied to the effect we’re trying to estimate?
Fisher information provides a precise answer to these questions. It quantifies the amount of information that data carry about a quantity of interest, like a parameter in a model. In doing so, it reveals the very geometry underlying the process of inference: how clearly the data can distinguish between nearby hypotheses.
If an observation can just as well be explained by Hypothesis 1 or Hypothesis 2, the observation is uninformative about what the correct hypothesis is. In the case of parametric models, hypothesis can be formulated simply in terms of different numerical values of the parameters and hence the question becomes whether or not the observation carries any meaningful information about the parameters.
Basic Idea
Consider a simple model where $y$ denotes a noisy measurement about a state $\theta$ given by:
\[y=\theta + \epsilon, \quad \epsilon \sim N(0,\sigma^2)\]The log-likelihood of observing $y$, if the underlying state happened to be $\theta$ is given by: \(\log p(y|\theta) = -\frac{(y-\theta)^2}{2\sigma^2} + \text{const.}\)
The Fisher information matrix of the observation, conditional on the model is given by:
\[I(\theta) = \mathbb{E} [ \nabla_\theta \log p(y|\theta)\nabla_\theta \log p(y|\theta)^T]\]Fisher information matrix represents how much information an observation carries about the parameters of interest (the vector $\theta$). Mathematically, it is equal to the expected outer product of the score function (the gradient of the log-likelihood)
In this simple model, the Fisher information matrix is simply equal to: \(I(\theta) = \frac{1}{\sigma^2}\)
Intuitively, this shows that as the noise variance increases, the information in the data decreases — the observations become less able to distinguish between nearby parameter values. The observation would not reduce uncertainty by much.
In case the observation model was: \(y=a\theta + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\)
where $a>0$ is a constant (say)
The information matrix would be: \(I(\theta) = \frac{a^2}{\sigma^2}\)
Larger the magnitude of $a$, the larger the signal-to-noise ratio in the data: the more effective data is at reducing our uncertainty about $\theta$. In this linear case, the information matrix is independent of state: it’s constant.
Example 1: Consider a noisy nonlinear signal \(y = \sin(x) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\) The sensitivity of $y$ to changes in $x$ is determined by the derivative of the signal, $\frac{dy}{dx} = \cos(x)$. The Fisher information with respect to $x$ is therefore \(\mathcal{I}(x) = \frac{1}{\sigma^2}\left(\frac{d\sin(x)}{dx}\right)^2 = \frac{\cos^2(x)}{\sigma^2}.\)
This means that information is not uniform across the state space:
- Around $x=0$, where $\cos(x)=1$, the signal is very sensitive to small changes in $x$; information is high.
- Around $x=\pi/2$, where $\cos(x)=0$, the signal flattens out; changes in $x$ barely affect $y$, so information is low.
The geometry of information here reflects how the nonlinear transformation $\sin(x)$ stretches or compresses the space of distinguishable signals. Some regions of the state space carry sharp, detailed information; others are nearly flat and uninformative.
Example 2: Consider a signal that is a nonlinear function of two state variables $x>0$ and $y>0$:
\(s = f(x, y) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2),\) where \(f(x, y) = \frac{x^2}{x + y}.\) Here, the numerator grows quadratically in $x$ while the denominator introduces a coupling between $x$ and $y$. This simple form already produces rich geometry: sensitivity depends sharply on the relative balance between the two state variables.
The gradient (Jacobian) of $f$ with respect to $(x, y)$ is \(\nabla f(x, y) = \left( \frac{\partial f}{\partial x}, \; \frac{\partial f}{\partial y} \right) = \left( \frac{x(x + 2y)}{(x + y)^2}, \; -\frac{x^2}{(x + y)^2} \right).\)
The first component shows how the signal responds to small changes in $x$,
while the second shows the (always negative) sensitivity to $y$.
Because $f$ depends on both variables in the ratio $x/(x+y)$, these partial derivatives are highly state-dependent and can even change sign.
For additive Gaussian noise, the Fisher Information Matrix is
\(\mathcal{I}(x, y)
= \frac{1}{\sigma^2} \nabla f^\top \nabla f,\)
which yields
\(\mathcal{I}(x, y)
= \frac{1}{\sigma^2 (x + y)^4}
\begin{bmatrix}
x^2 (x + 2y)^2 & -x^3(x + 2y)\\
x^3(x + 2y) & x^4
\end{bmatrix}.\)
This $2\times2$ matrix has \textbf{rank one}, because it is the outer product of a single gradient vector with itself.
Hence, there is only one independent direction in the $(x, y)$ state space that affects the signal—information is concentrated along this single direction, and orthogonal directions are locally unobservable.
The geometry of information for this signal is highly anisotropic:
- The signal is most informative when the denominator $(x + y)$ is small, where small changes in state cause large changes in $f(x, y)$: the signal becomes information-poor when $y»x$, since the sensitivity of $f$ to either state diminishes
- The cross-terms capture statistical coupling between $x$ and $y$: changes in one state cannot be fully disentangled from changes in the other - uncertainties are coupled
Geometrically, the information manifold looks like a narrow ridge in the $(x, y)$ plane. Along the ridge, small state perturbations lead to sharp, distinguishable changes in the signal. In directions orthogonal to it, the signal remains almost constant, and the system is locally unobservable.
Rank of the Fisher Matrix and Dimensionality of Information
The rank of the Fisher Information Matrix tells us the dimension of the locally identifiable subspace. In the previous example:
\(\text{rank}(\mathcal{I}) = 1.\)
Only one combination of $(x, y)$ is statistically distinguishable from the data.
In filtering terms, the observation provides information about a single mode of the state, while uncertainty in the orthogonal (uninformative) direction must evolve under the system dynamics alone. If we had two linearly independent signals, the Fisher information would have full-rank. If signals are not linearly independent, then there is redundancy. But redundant signals can still reduce uncertainty by reinforcing each other. They do not however carry new qualitative information.
Think of trying to find a location on a map:
- Redundant Signals: Two people give you the same line — “the target is somewhere along this road.” Hearing it twice makes you more sure you’re on the right road, but it doesn’t tell you where along it the target is.
- Independent signals: One person says “it’s north of here,” the other says “it’s east of the river.” Now you can triangulate — you’ve gained genuinely new information.