Inverse Transform Sampling
Often in the course of writing some piece of code for data analysis, or in making a simulation of a system, like a virus spreading through a population, gene expression in a cell, or the dynamics of the stock market, we'll want to sample random draws from a probability distribution. The problem is that most languages come equipped only with simple random number generators, capable of drawing \(\hat{u}\) uniformly from the real unit interval \(\left[0,1\right]\), or from a given range of integers. The uniform distribution is fine when we need to simulate something simple like rolling dice, but is inadequate when more sophisticated probabilities are crucial; we commonly require nonlinear distributions like the Gaussian, Poisson, and even exotic choices like the Gumbel distribution. Can we find a way to sample from arbitrary probability distributions using simple random number generators?
Before we begin, let's look at an example of the impact of using the wrong probability distribution in a simulation.
Consider the two social networks simulated below. The network on the right was generated using the so-called Bernoulli distribution to determine the number of friends that each individual has, while the one on the left was generated using the somewhat more realistic Albert-Barabasi distribution.
For fair comparison, the networks are constrained to have equal numbers of individuals (\(5\times 10^3\)), and total number of friendships (\(10\times 10^3\)).
There are some clear differences between the networks, including that the Bernoulli network has a large number (\(\sim 20\)) of highly clustered subnetworks that contain a roughly equal number of nodes, while the Albert-Barabasi network has only a handful of highly connected subnetworks, then a hierarchy of less populated, but still highly connected subnetworks.
We can also clearly see the imprint of the generating distributions, as the Albert-Barabasi network exhibits a greater variety in the number of friends per node as compared to the Bernoulli network, which appears fairly homogenous. This is because the Bernoulli distribution falls off very quickly in the tails.
There are many other interesting differences in the structure of these networks that are beyond the scope of this example, but we can see already that employing the correct probability distribution in simulation is quite important.
Sampling Transform
We wish to sample a non-flat distribution using a random number generator, which draws from the uniform distribution. So we need some way to transform samples from the uniform distribution into samples from the arbitrary distribution \(f\).
Imagine having a representative sampling, \(\hat{f} = \{\hat{f}_1, \hat{f}_2, \ldots, \hat{f}_N\}\) We can rank-order this list and draw randomly on the index \(i\), each value \(\hat{f}_i\) would have an equal chance of being drawn, which would give us the kind of sampling we're after. In the limit of large \(N\), the set \(\hat{f}\) is increasingly representative of the true distribution \(f\)
For example, suppose we have the following representation of the binary distribution, which has equal chance of generating 1 or 0. If we sampled randomly on the index, it is easy to see that we transform random draws from the integers \(\{1,\ldots,N\}\) into draws from the binary distribution:
\[\begin{array}{c|c} \text{index} & \textrm{value} \\ \hline 1 & 0 \\ 2 & 0 \\ \vdots & \vdots \\ N/2 & 0 \\ N/2+1 & 1 \\ \vdots & \vdots \\ N & 1 \end{array}\]
Now suppose we have a similar table for a representative sample from the arbitrary distribution \(f\). Let us collapse the sample into a table of pairs where the first entry is the value, and the second is the number of times it appears, and sort the table entries by value:
\[\begin{array}{c|c} \text{value} & \textrm{frequency} \\ \hline v_1 & n_1 \\ \vdots & \vdots \\ v_M & n_M \end{array}\]
If we want to sample randomly from \(f\) we can simply draw a random integer \(x\) between 1 and \(N\), then do a lookup in the table. For some \(j\), we'll have \(N_{j-1} < x < N_j\), where \(N_j = \sum\limits_{i=1}^j n_i\), and so we map \(x\rightarrow v_j\)
If we take the limit of large samples \(N\rightarrow\infty\), and normalize \(n_i\rightarrow p_i = n_i/N\), then we can instead draw a random real number \(\hat{u}\) between zero and one, and map it to \(v_j\) by finding the approximate solution of \(\hat{u} = P_j = \sum\limits_{i=1}^j p_i\)
In the continuous limit, we have \(\displaystyle\sum\limits_{i=1}^j p_i \rightarrow \int\limits_{-\infty}^{\hat{x}} p(x) dx\), and thus our sampling problem amounts to solving the integral equation
\[\begin{align*} \hat{u} &= \int\limits_{-\infty}^{\hat{x}} p(x^\prime) dx^\prime \\ &= \textrm{cdf}(\hat{x}) \end{align*}\]
Thus, if we wish to sample from an arbitrary distribution \(p(x)\), we simply sample \(\hat{u}\) from the flat distribution and map to the solution of \(\hat{u} = \textrm{cdf}(\hat{x})\), i.e.
\[\boxed{\hat{x} = \textrm{cdf}^{-1}(\hat{u})}\]
For a visual illustration of how this works, consider the bimodal Gaussian shown below. We expect draws from this distribution to be centered around the two peaks, and for very few draws to come from the region between, or beside them.
As we showed with the table analogy, drawing randomly on the indices of the table representation of \(\hat{f}\) is the same as sampling the distribution \(\hat{f}\). In the continuous case, drawing randomly from the range of the cdf of \(f\) (and mapping to the associated \(\hat{x}\) value) is equivalent to drawing randomly from \(\hat{f}\) itself. Thus, if we draw a horizontal line at a random height, we expect to have high probability of mapping to the two Gaussians, and a very low probability of mapping to the regions outside them.
We see that the Gaussians occupy most of the vertical space in the range of the cdf, while the inter-Gaussian occupy only a sliver. We can place our horizontal lines almost anywhere and have a good chance of mapping to one of the two Gaussians. Indeed, to map to the low probability regions, we'd have to randomly place our horizontal line precisely on the plateaus in the cdf (gold box above).
Examples
Exponential distribution
The cdf of the exponential distribution is given by
\[\begin{align} \textrm{cdf}(\hat{x}) &= \int\limits_{-\infty}^{\hat{x}} p(x) dx \\ &= \lambda \int\limits_0^{\hat{x}} e^{-\lambda x} dx \\ &= 1-e^{-\lambda \hat{x}} \end{align}\]
Solving for \(\hat{x}\), we find \(\boxed{\hat{x} = \frac{1}{\lambda}\log\frac{1}{1-\hat{u}}}\)
Now let's show that it works. It would be straightforward to write our code completely from scratch, however it is faster, more convenient, and above all, more reliable if we use an open source project to handle some common operations. We'll use the numerical Python librarynumpy
to generate some random numbers,
1 2 3import numpy as np import matplotlib.pyplot as plt uniform_rands = np.random.random(1000000)
For a sanity check, let's see what we have at this point. If we plot a histogram, we should see a flat distribution of real numbers between 0 and 1.
Now we map our sample from the flat distribution according to the cdf transform we found above.
1 2 3param = 1.8 transformed_rands = [1 / param * np.log(1 / (1 - x)) for x in uniform_rands] plt.hist(transformed_rands, bins=500, normed=True)
A co-plot of the desired distribution, alongside a normalized histogram of the data shows that our transformed sample matches our desired distribution very closely. In the limit of more points, the agreement will be exact.
Fat-tailed distributions
Seemingly insignificant points of disagreement between true distributions and an approximating distribution can lead to harmful consequences. A compelling example of this are the so-called long tail regions of distributions which describe the probability of rare events. Sometimes, mathematically convenient distributions are used to stand in for empirically determined distributions. For instance, Gaussian distributions and their close relatives are used to describe some kinds of behavior in financial markets.
In many cases, the distribution that is used is a very close approximation to the true distribution, differing only in the long tails. Though disagreement in the tails may seem like an innocuous discrepancy (it is a difference in an already sparse region of the distribution), it is actually very significant, and for purposes of risk management, perhaps the most significant.
By definition, these events happen very rarely, which makes them both psychologically convenient to ignore, and technically difficult to properly characterize. It is this combination that makes them so insidious. As a simple illustration of the danger of mis-appraising tail probability, consider the two distributions below, one of which is the Gaussian distribution, and one of which is a reverse Gumbel distribution.
In the view on the left, the distributions look very similar, they have the same mean, nearly the same profile out to 90% or more of the cumulative area of the two distributions, and both fall quickly to small values outside of the area of agreement. However, in the logarithmic view, we can see clearly that the Gaussian distribution believes the risk of a 12-fold loss to be almost 10 orders of magnitude less likely than does the reverse Gumbel distribution.
In real terms, this means that the analyst who uses the Gaussian believes the probability of a 12-fold draw down to be roughly 1 in every \(10^{11}\) trades, while the reverse Gumbel using analyst expects it to happen roughly once every \(2\times10^3\) trades. This divergence in belief will lead to completely different behavior as concerns trading strategies and risk management, and could eventually lead to the Gaussian firm having a blowup that they can't cover.
Though the full details of the situation are undoubtedly more complex than this simple example, just such use of close, but not exact stand-in distributions contributed to the famed blowup of Long Term Capital Management, an arbitrage fund backed by the academic might of two recipients of the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel. After 4 years of unmatched return on investment, the fund exploded in an incredible collapse over the course of a summer, in part due to gross misestimation of correlations between different events.
Trajectory of the value of LTCM's assets under management (March 1994 \(\rightarrow\) October 1998)
Social Networks