Statistics is the science of dealing with data. This involves understanding data that already exists, using it to make predictions about the future and assessing the uncertainty of those predictions. Doing these things often requires using the language of probability to capture the randomness involved. On the whole, statistics can be a very powerful tool for understanding the world and has applications in fields as varied as finance and physics.
Given a random event, one might consider the sample space of possible events or outcomes. For instance, the sample space of a fair six-sided die has six possible events, one for each side. One could denote the sample space by the set . A random variable can then be thought of as a function on the sample space, where the probability of taking on any given outcome is weighted accordingly. In the case of the fair die, each outcome is equally likely, although one could certainly come up with a loaded die in which all six events are not equally likely.
The sample space of a random variable can be discrete or continuous. While the result of a rolled die is discrete, imagine shooting an arrow at a target. The possible points where the arrow could hit form a continuous sample space.
To describe a random variable completely, one must specify both the sample space (usually as a set) and the probability of all events in the sample space. For a discrete random variable , the latter is often accomplished by writing where is some event in the sample space. For instance, for a fair six-sided die whose roll is the random variable , one could write
For a continuous random variable whose sample space is a subset of the real numbers, generally one gives the probability density function defined by
where is the probability that lies between and inclusive. One can view as the infinitesimal probability of obtaining a value within a small interval around . Integrating across a finite interval thus provides the probability that the outcome will lie in that region.
Several continuous distributions arise frequently in statistics:
The normal distribution has density where and are (fixed) parameters of the distribution.
One fundamental property of a random variable is that the probability of obtaining any outcome must be equal to . In other words, if denotes the sample space,
Suppose one is given a set of numerical data , all of which are real-numbered values. How might one describe the data? Perhaps of interest is the (arithmetic) mean, denoted by , the sum of all values divided by the number of values:
In many cases, the mean might be taken to respresent a "typical" or "average" value.
Or, one might be interested in the average squared difference from the mean, which is called the variance and denoted by
The variance may represent the square of average "fluctuation" in the values.
In any case, given a set of random variables , one can represent the data using one or more statistics. A statistic, such as the mean or variance, is simply a function of the random variables
It would be nice to know the values of every possible piece of data, but in most practical cases, this is neither feasible nor even desirable to obtain. Instead, one must estimate a desired statistic of a population based on the statistics of a sample of values drawn from the population.
Suppose one is given the weights of a small sample of gold bars. How might one use that to determine the the mean weight of all of the gold bars minted on the same day, of which only a small number were chosen in the sample? Intuitively, it might seem that a good estimate to represent the mean of all of the gold bars (the population mean) is simply the mean of the sample. But a number of other statistics—that is, any function of the values chosen in the sample—might be of relevance, such as the standard deviation or variance (the square of the standard deviation). How does one choose a good estimate of the population value for these other statistics?
An estimator is some function of the sample random variables. It is itself a random variable. The estimator evaluated for given values of the sample random variables is called an estimate. When a particular