Three Underrated Theorems in Probability
Probability theory has a few results that receive most of the attention: the central limit theorem, Bayes’ rule, the law of large numbers. These are foundational and rightly so. But there are other theorems that quietly shape how statisticians and machine learning researchers think about convergence, uncertainty, and estimation, yet rarely appear in standard courses. Here are three that I have found particularly useful.
The Law of the Iterated Logarithm
The central limit theorem tells us that the sample mean converges to the true mean at rate $O(1/\sqrt{n})$. What it does not tell us is how the fluctuations behave as $n$ grows. The law of the iterated logarithm closes this gap with a remarkably precise statement.
For i.i.d. random variables $X_1, X_2, \ldots$ with mean $\mu$ and finite variance $\sigma^2$:
\[\limsup_{n \to \infty} \frac{\sum_{i=1}^n (X_i - \mu)}{\sigma \sqrt{2n \log \log n}} = 1 \quad \text{almost surely}\]The denominator $\sqrt{2n \log \log n}$ grows slightly faster than $\sqrt{n}$ but much slower than $n$. This gives the exact scaling of the maximum deviation of the sample mean from the true mean, observed infinitely often. For example, with $n = 10^6$ and $\sigma = 1$, the LIL predicts maximum fluctuations of order:
\[\sigma \sqrt{2n \log \log n} \approx \sqrt{2 \cdot 10^6 \cdot \log \log 10^6} \approx 1.7 \cdot 10^3\]so the normalized deviation $\frac{1}{n}\sum_{i=1}^n (X_i - \mu)$ reaches approximately $1.7 \cdot 10^{-3}$ infinitely often. In practice, this means that if you are monitoring an online learning algorithm or a streaming estimator, you should expect occasional deviations that look large relative to the standard error but are entirely normal. Misinterpreting these as model failure leads to unnecessary parameter tuning or premature abandonment of sound methods.
The Dvoretzky–Kiefer–Wolfowitz Inequality
The Glivenko-Cantelli theorem states that the empirical CDF converges uniformly to the true CDF, but it does not specify the rate. The Dvoretzky–Kiefer–Wolfowitz inequality closes this gap with a distribution-free bound:
\[P\left(\sup_{x \in \mathbb{R}} |F_n(x) - F(x)| > \epsilon\right) \leq 2e^{-2n\epsilon^2}\]Here $F_n$ is the empirical distribution function from $n$ samples and $F$ is the true CDF. The bound holds for any distribution, which makes it unusually powerful. If you want a confidence band around your empirical CDF, the DKW inequality gives you one without assuming anything about the underlying distribution beyond independence.
To make this concrete, suppose we want a $95\%$ confidence band, so the right-hand side should be at most $0.05$. Solving $2e^{-2n\epsilon^2} = 0.05$ gives:
\[\epsilon = \sqrt{\frac{\log(40)}{2n}} \approx \frac{1.07}{\sqrt{n}}\]For $n = 1000$, the band width is approximately $0.034$. The band is tight and requires no knowledge of the underlying distribution beyond independence.
This result underlies generalization bounds for classification, confidence intervals for quantile estimates, and the theoretical justification for the Kolmogorov-Smirnov test. The distribution-free nature is the key: it means the bound applies even when you know almost nothing about the data-generating process.
Stein’s Lemma
In 1956, Charles Stein proved that for estimating the mean of a multivariate Gaussian, the sample mean is admissible in one and two dimensions, but inadmissible in three or more. There exists a shrinkage estimator that dominates it everywhere. The James-Stein estimator, which achieves this, shrinks the sample mean toward a common point:
\[\hat{\theta}_{JS} = \left(1 - \frac{(d-2)\sigma^2}{\|\mathbf{X}\|^2}\right) \mathbf{X}\]where $d \geq 3$ is the dimension and $\mathbf{X}$ is the observed vector. The shrinkage factor pulls the estimate toward zero when the data is noisy relative to its magnitude.
The risk of the James-Stein estimator is uniformly smaller than that of the sample mean. For $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\theta}, \sigma^2 I_d)$, the improvement is:
\[R(\hat{\theta}_{JS}, \boldsymbol{\theta}) = d\sigma^2 - (d-2)^2 \sigma^4 \, \mathbb{E}\left[\frac{1}{\|\mathbf{X}\|^2}\right]\]which is strictly less than $d\sigma^2$, the risk of the sample mean $\mathbf{X}$, for all $d \geq 3$. The improvement is largest when $|\boldsymbol{\theta}|$ is small relative to $\sigma$.
This result breaks the intuition that unbiased estimators are automatically good. In high dimensions, introducing a small bias to reduce variance can improve overall risk. This insight directly motivates ridge regression, empirical Bayes methods, and modern shrinkage techniques. The lesson generalizes: when the dimension grows, structured assumptions and deliberate bias often outperform naive unbiasedness.
Each of these theorems reveals a different facet of how probability behaves at scale: the iterated logarithm gives fluctuation bounds, the DKW inequality gives distribution-free concentration, and Stein’s lemma shows that bias can be a feature rather than a bug. They are not always the first tools to reach for, but knowing they exist changes how you think about convergence, uncertainty, and estimation in high dimensions.