# Statistics

**Statistics** is the science of dealing with data. Collecting and understanding real-world data require building models to describe the underlying probabilistic processes.

## Random variables

The language of statistics is rooted in **probability theory**, which provides a means for dealing with **random variables** and **stochastic processes**.

Given a random event, one might consider the **sample space** of possible *events* or *outcomes*. For instance, the sample space of a fair six-sided die has six possible events, one for each side. One could denote the sample space by the set \( \{1, 2, 3, 4, 5, 6 \} \). A *random variable* can then be thought of as a *function* on the sample space, where the probability of taking on any given outcome is weighted accordingly. In the case of the fair die, each outcome is equally likely, although one could certainly come up with a *loaded die* in which all six events were not equally likely.

The sample space of a random variable can be discrete or continuous. While the result of a rolled die is discrete, imagine shooting an arrow at a target. The possible points where the arrow could hit form a continuous sample space.

To describe a random variable completely, one must specify both the sample space (usually as a set) and the probability of all events in the sample space. For a discrete random variable \( X \), the latter is often accomplished by writing \( P(X = A), \) where \( A \) is some event in the sample space. For instance, for a fair six-sided die whose roll is the random variable \( Y \), one could write

\[ P(Y = j) = \frac{1}{6} \]

for \( j = 1, 2, 3, 4, 5, 6. \)

For a continuous random variable \( X \) whose sample space is a subset of the real numbers, generally one gives the **probability density function** \( p \) defined by

\[ P(a \leq X \leq b) = \int_a^b p(x) \, dx, \]

where \( P (a \leq X b) \) is the probability that \( X \) lies between \( a \) and \( b, \) inclusive. One can view \( p(x) \, dx \) as the infinitesimal probability of obtaining a value within a small interval \( dx \) around \( x \). Integrating \( p(x) \) across a finite interval \( [a, b] \) thus provides the probability that the outcome will lie in that region.

Several continuous distributions arise frequently in statistics:

- The
**normal distribution**has density

\[ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(x - \mu)^2}{2\sigma^2}\right], \]

where \( \mu \) and \( \sigma \) are (fixed) *parameters* of the distribution.

- The
**chi-squared distribution**with \( k \)*degrees of freedom*is the sum of \( k \) squared normal random variables:

\[ X = \sum_{i=1}^k Z_i^2, \]

where each of the \( Z_i's \) is an independent normal random variable with \( \mu = 0 \) and \( \sigma = 1. \)

One fundamental property of a random variable is that the probability of obtaining *any* outcome must be equal to \( 1 \). In other words, if \( S \) denotes the sample space,

\[ P(X \in S) = 1. \]

## Estimation

Suppose one is given a set of numerical data \( X_1, X_2, \ldots, X_n \), all of which are real-numbered values. How might one describe the data? Perhaps of interest is the (arithmetic) **mean**, denoted by \( \mu \), the sum of all values divided by the number of values:

\[ \mu = (X_1 + X_2 + \cdots + X_n)/n. \]

In many cases, the mean might be taken to respresent a "typical" or "average" value.

Or, one might be interested in the average squared difference from the mean, which is called the **standard deviation** and denoted by

\[ \sigma = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2. \]

The standard deviation may represent the average "fluctuation" in the values.

In any case, given a set of random variables \( X_1, X_2, \ldots, X_n \), one can represent the data using one or more **statistics**. A statistic, such as the mean or standard deviation, is simply a function of the random variables \( f( X_1, X_2, \cdots, X_n). \).

It would be nice to know the values of every possible piece of data, but in most practical cases, this is neither feasible nor even desirable to obtain. Instead, one must *estimate* a desired statistic of a **population** based on the statistics of a **sample** of values drawn from the population.

Suppose one is given the weights of a small sample of gold bars. How might one use that to determine the the mean weight of *all* of the gold bars minted on the same day, of which only a small number were chosen in the sample? Intuitively, it might seem that a good estimate to represent the mean of all of the gold bars (the *population mean*) is simply the mean of the sample. But a number of other statistics—that is, any function of the values chosen in the sample—might be of relevance, such as the standard deviation or variance (the square of the standard deviation). How does one choose a good estimate of the population value for these other statistics?

An **estimator** is some function \( \delta(X_1, X_2, \ldots, X_n) \) of the sample random variables. It is itself a random variable. The estimator evaluated for given values of the sample random variables is called an **estimate**. When a particular