Maximum likelihood estimation (MLE) is a technique used for estimating the parameters of a given distribution, using some observed data. For example, if a population is known to follow a normal distribution but the mean and variance are unknown, MLE can be used to estimate them using a limited sample of the population, by finding particular values of the mean and variance so that the observation is the most likely result to have occurred.
Let be observations from independent and identically distributed random variables drawn from a Probability Distribution , where is known to be from a family of distributions that depend on some parameters . For example, could be known to be from the family of normal distributions , which depend on parameters (standard deviation) and (mean), and would be observations from .
The goal of MLE is to maximize the likelihood function:
Often, the average log-likelihood function is easier to work with:
There are several ways that MLE could end up working: it could discover parameters in terms of the given observations, it could discover multiple parameters that maximize the likelihood function, it could discover that there is no maximum, or it could even discover that there is no closed form to the maximum and numerical analysis is necessary to find an MLE.
Though MLEs are not necessarily optimal (in the sense that there are other estimation algorithms that can achieve better results), it has several attractive properties, the most important of which is consistency: a sequence of MLEs (on an increasing number of observations) will converge to the true value of the parameters. The following is an example where the MLE might give a slightly poor result compared to other estimation algorithms:
The simplest case is when both the distribution and the parameter space (the possible values of the parameters) are discrete, meaning that there are a finite number of possibilities for each. In this case, the MLE can be determined by explicitly trying all possibilities.
A (possibly unfair) coin is flipped 100 times, and 61 heads are observed. The coin either has probability , or of flipping a head each time it is flipped. Which of the three is the MLE?
Here, the distribution in question is the binomial distribution, with one parameter . Thus
hence the MLE is .
Unfortunately, the parameter space is rarely discrete, and calculus is often necessary for a continuous parameter space. For instance,
A (possibly unfair) coin is flipped 100 times, and 61 heads are observed. What is the MLE when nothing is previously known about the coin?
Again, the binomial distribution is the model to be worked with, with a single parameter . The likelihood function is thus
to be maximized over . This can be achieved by analyzing the critical points of this function, which occurs when
so either , or 1. Thus is the MLE, as otherwise the likelihood function is 0.
This logic is easily generalized: if of binomial trials result in a head, then the MLE is given by .