Hidden Markov Models

$A simple hidden Markov model with 3 states and 4 observation tokens. The states are represented by the labels $x_1$, $x_2$, and $x_3$, while the transition probabilities between states are represented by the edge label $a_{ij}$ for the probability of moving from state $x_i$ to state $x_j$. Observations are represented by the labels $y_1$, $y_2$, $y_3$, and $y_4$. The probability of state $x_i$ generating observation $y_j$ is represented by the edge label $b_{ij}$.$ A simple hidden Markov model with 3 states and 4 observation tokens. The states are represented by the labels $x_1$, $x_2$, and $x_3$, while the transition probabilities between states are represented by the edge label $a_{ij}$ for the probability of moving from state $x_i$ to state $x_j$. Observations are represented by the labels $y_1$, $y_2$, $y_3$, and $y_4$. The probability of state $x_i$ generating observation $y_j$ is represented by the edge label $b_{ij}$.

A hidden Markov model is a type of graphical model often used to model temporal data. Unlike traditional Markov models, hidden Markov models (HMMs) assume that the data observed is not the actual state of the model but is instead generated by the underlying hidden (the H in HMM) states. While this would normally make inference difficult, the Markov property (the first M in HMM) of HMMs makes inference efficient.

Because of their flexibility and computational efficiency, hidden Markov models have found a wide application in many different fields. They are known for their use in temporal pattern recognition and generation such as speech recognition, handwriting recognition, and speech synthesis.

Overview

Often, when a person makes an observation of a system, what is being observed is not the state of the system but instead some tokens or data generated by the system's underlying hidden states. For example, when a person hears someone else speak, the sounds entering their ears are not the state of the system generating the sound. The true state of that system would be the collection of parameters directly determining which sounds are generated, such as the shape of the speaker's mouth, the frequency of their vocal chords, and the semantic meaning behind the sounds.

However, it is impossible to know the exact state of the underlying system from the observations alone, since many underlying states may correspond to the same observation. Thus, it is possible for the function mapping states to observations $f(s)=o$, where $s$ is a state variable and $o$ is an observation generated by the function $f(x)$ applied to $s$, to have two different states $s_1 \ne s_2$ that generate the same observation $f(s_1) = f(s_2)$. This is known as a many-to-one function, which in general is impossible to perfectly invert, since $f^{-1}(o)$ could equal $s_1$ or $s_2$.

Consider the many-to-one function $y=x^2$. Given only $y$, one cannot recover $x$ exactly since both $-x$ and $x$ correspond to $y$ when squared. An HMM is similar in that, given an observation sequence, one can only infer a distribution over the states at any point in time, since many different sequences of underlying states can give rise to the same observation sequence.

In the case of speech recognition, knowing the underlying semantic meaning which generated the sounds would make the problem easy. This is because the whole problem of speech recognition is discovering the model states (semantic meaning) behind the observations (sounds). Thus, being able to model and infer the underlying states of an HMM given the observations provides a powerful technique for many kinds of problems.

Time Evolution

Hidden Markov models evolve according to two rules:

The first rule is that the model moves from the current state to the next state, which may be the same state, according to some probability distribution that depends only on the current state, i.e. $p(s_t|s_{t-1})=p(s_t|s_{t-1}, \dots, s_0)$. This is known as the Markov property. Intuitively, this rule states that the system evolves without regard to past states of the system and only depends on the current state.
The second rule is that after each transition, the model emits an observation whose distribution depends only on the current state, i.e. $p(o_t|s_t)=p(o_t|s_t, o_{t-1}, s_{t-1}, \dots, o_0, s_0)$. Since the model only emits the observations and the states generating the observations are unknown to the observer, the states generating those observations are called hidden states.

^[1]

Specification

A hidden Markov model is fully specified by the following parameters:

1) State Transition Probabilities
The probability of transition from state $s_i$ to state $s_j$ is $a_{ij}$.

2) Observation Emission Probabilities
The probability of emitting observation $o_t$ while in state $s_i$ is $P(o_t|s_i)$. If the set of observations is discrete, then the probability of emitting token $o_j$ from state $s_i$ is $b_{ij}$.

3) State Initialization Probability
The probability of the HMM starting in state $s_i$ is $\pi_i$.

Knowing the above parameters lets us generate observation sequences very easily. We start by picking an initial hidden state $s_0$ according to the distribution $\pi$. Then, we pick an observation $o_0$ given $s_0$. Next, the model transitions to a new hidden state $s_1$ according to the state transition probability for state $s_0$, after which the model emits a new observation $o_1$. This continues until the desired number of observations is generated.

Observation Probability

Since HMMs calculate a probabilistic sequence of hidden states and outputs, it is natural to ask what the likelihood of a particular observation sequence $O$ is. Calculating the probability of an observation sequence involves a sum over the hidden states of the HMM, which for large HMMs, makes the naive calculation very slow. Luckily, applying techniques from dynamic programming can make this calculation tractable.

Imagine two speakers, Bob and Alice. If each has an HMM trained to model their voice, then we can calculate the likelihood of any sound sequence given each of their trained HMMs. If we calculate the likelihood of Bob's voice given his HMM, we would expect it to be higher than the likelihood of his voice given Alice's HMM. Intuitively, this is because observation sequences coming from Bob's HMM should sound like Bob, or at least more like Bob than Alice since it was trained on Bob's voice. Likewise, Alice's voice should have a higher likelihood on her HMM than on Bob's HMM. Knowing this, we can perform some basic speaker recognition, just by calculating the likelihood of an observation sequence given some known HMMs. The most likely speaker is the one whose HMM best generates their actual voice data.

^[2]

References

, M. Hidden-markov-model-abc. Retrieved September 11, 2013, from https://commons.wikimedia.org/wiki/File:Hidden-markov-model-abc.svg)
, T. HiddenMarkovModel. Retrieved August 29, 2007, from https://commons.wikimedia.org/wiki/File:HiddenMarkovModel.png

Contents