Introduction to data as random variables

This post demonstrates how to compute on a numerical array as if it were a statistical random variable.

Numeric data generated by some process can be interpreted as a random variable. More specifically, we can understand it as a discrete random variable approximating the underlying probability distribution. If we have access to all the data, then the approximation can be made exact.

In a previous post I wrote about how much I was starting to enjoy scipy.stats. After becoming a little more familiar with the statistical methods, I wanted to know how to calculate the equivalent concepts in Python without knowing the distribution. If the data you’re working with is discrete, then it’s straightforward to create your own distribution using stats.rv_discrete. Here’s a function that makes it easy.

def rv_discrete_factory(a):
    '''
    Creates a discrete random variable from scipy.stats

    >>> rv = rv_discrete_factory((0, 0.5, 0.5, 1))
    >>> rv.cdf(0.9)
    0.75

    '''
    vals, counts = np.unique(a, return_counts=True)
    probs = counts / sum(counts)
    return stats.rv_discrete(values=(vals, probs))

Note that the parameter for np.unique is new as of Numpy 1.9, so if you have an earlier version don’t expect it to work. The returned object includes all the bells and whistles that come with the scipy.stats distributions. Cool. But what if we feed it too much data?

In [51]: million_pts = np.random.randn(int(1e6))

In [52]: million_pts.shape
Out[52]: (1000000,)

In [53]: %timeit rv_discrete_factory(million_pts)
1 loops, best of 3: 5.24 s per loop

It took 5 seconds to instantiate, and the methods on it are also slow. This is not a flaw; rather, it’s not how the function was designed to be used. With one million or more unique points, it would be better to first do some preprocessing and discretization if the goal was to model using a discrete random variable. But this data still fits comfortably in memory, which means we can do all the same statistical operations in Numpy. In the next post we’ll explore those techniques.