Hidden Markov Models are a technique for dealing with sequence data.
HMMs are used for sequential data, but sometimes may also be applied to continuous data by discretizing the time component between features. Frequency within a window can be an observation.
The order is the number of dependencies for each node.
x is a d-dimensional vector from $D=\{\underling{x}_1,\dots,\underling{x}_T\}$, but each observation is dependent on a state defined within a discrete state space of size $k$. The state is hidden. We assume that x is conditionally independent of all other variables given its state. We also assume the states form a Markov chain.
$z_t$ is the state at time t. $z_t = \in {1,\dots K}$
States are assumed to have some type of transition model
between them. Each state also requires an observation model
.
The full joint is then $p(x_{1\dots T},z_{1\dots T}) = \prod_{t=1}^T p(x_t|x_t) p(z_t|z_{t-1}) $
emission densities
or emission distributions
$p(x_t|z_t=k,\theta_k)$How to select k? Depends on the problem. May not be a “true” k at all.
HMMs have been the industry standard in speach recognition for decades. Still kind of state of the art in speach recognition. Also used in text processing and Bioinformatics.
Given $\theta$ = $K$ emission density parameters and a transition matrix $$. <latex>\begin{align*}
&= \sum_{z_{1\dots T}} p(x_{1,\dots,T} z_{1,\dots, T}|\theta) & \text{by LTP}
\end{align*}</latex>
This sum is intractable to compute directly, $O(K^T)$. Use a graphical model factorization.
<latex>\begin{align*}
&= \sum_{i} p(z_{T}=j,z_{T-1}=i,x_{1\dots T}) & \text{By LTP}
&= \sum_i p(x_T|z_T = j) p(z_T = j|z_{T-1} = i) p(z_{T-1}|x_{1\dots T-1})& \text{By factorization}
&= \sum_i p(x_T|z_T = j) p(z_T = j|z_{T-1} = i) \alpha_{T-1}(i)
\end{align*}</latex>
Given $\alpha_{T-1}(i)$ for $i=1\dots K$, can compute $\alpha_T(d)$ in time $O\left(K^2 + K f(d)\right)$.
This is a nice recursive form that allows computing the full joint model in time $O\left(TK^2 + TK f(d)\right)$. This is known as the Forward Algorithm, the first step in the Forward-Backward Algorithm.
<latex>\begin{align*}
&\propto p(Z_t =j,x_{1\dots t},x_{t+1,\dots ,T})
&= p(x_{t+1\dots T}|z_t =j,x_{1\dots t}) p(z_t=j , x_{1\dots t}) & \text{By factorization}
&= p(x_{t+1\dots T}|z_t =j) p(z_t=j , x_{1\dots t})
&= p(x_{t+1\dots T}|z_t =j) \alpha_T(j)
\end{align*}</latex>
Define $\beta_t(j) = p(x_{t+1 , \dots T}|z_t = j) \Rightarrow p(z_t = j | X_{1 \dots T}) \beta_t(j) \alpha_t(j)$. This can also be computed recursively in time $O\left(TK^2 + TK f(d)\right)$.
To compute $p(z_t=j|x_{1\dots T})$ for all $t,j$,
(This is similar to the E-step of EM)
$\alpha_T{d} = p(Z_T=d, X_{1 \dots T})$ Can compute $\alpha_T$ from $\alpha_{T-1}$ in time $O\left(K^2 + Kf(d)\right)$. $p(z_T = d | X_{z \dots T}) = \frac{\alpha_T(d)}{\sum_{k=1}^K \alpha_T(k)}$
$w_t(d) = p(z_t = d | x_{1 \dots T}) \propto \alpha_t(d) \beta_t(d)$
In total, to compute $\alpha_t,\beta_t$ for $t= 1\dots T$ is in $O(TK^2 + TKf(d))$. This is the forward-backward algorithm.
The typical method for doing inference with an HMM is the Forward-Backward Algorithm.
$P(z_{t} = j | x_{t}) \propto P(z_{t}=j,X_{1\dots T})$ $ = P(z_{t}=j, x_{1\dots t}, x_{t+1, \dots T})$ $ = P(x_{t+1, \dots T}| z_{t}=j, x_{1 \dots t}) P(z_{t=j}, x_{1 \dots t})$ Want to compute these two terms.
Consider the second term: <latex>\begin{align*}
&= \sum_{z_{t-1}} p(x_{t} | z_{t=j}) p(z_{t=j} | z_{t-1}) p(z_{t-1} , x_{1, \dots t-1})
\end{align*}</latex>
Then you can recurse by using the result of the previous time-step, the transition matrix, and the current evidence
to compute the term for the current step.
This is $O(K^2)$ work for all K values. This is known as the forward step in the forward-backward algorithm.
Consider the first term: <latex>\begin{align*}
&= \sum_{Z_{t+1}} p(x_{t+1, \dots T} | z_{t+1}, z_{t=j}) p(Z_{t+1} | z_{t=j})
&= \sum_{Z_{t+1}} p(x_{t+1, \dots T} | z_{t+1}) p(Z_{t+1} | z_{t=j}) & \text{by C.I.}
&= \sum_{Z_{t+1}} p(x_{t+2, \dots T} | z_{t+1}) p(x_{t+1} | Z_{t+1}) p(Z_{t+1} | z_{t=j})
\end{align*}</latex>
Now this is another recursive solution which can be computed in $O(K)$ for fixed $j$, $O(K^2)$ for $j=1\dots K$.
This is known as the backward step in the forward-backward algorithm.
May use Gibbs Sampling, but EM is generally faster.
Sometimes used as an approximation to the forward-backward algorithm.
In the E-step, we want to compute $p(Z_t=d|x_{1\dots T}) = w_t(d)$, a T-by-K membership weights matrix. Do this using the F-B algorithm with current parameters $\theta$. Also need:
<latex>\begin{align*}
&\propto \sum_{t=1}^{T-1} p(z_t=i,z_{t+1}=j,X_{1\dots T})
&= p(X_{t+2\dots T}|z_{t+1}=j)p(X_{t+1}|z_{t+1}=j) p(z_{t+1}=j|z_t=i) p(z_t=i,X_{1\dots T})
&= \Beta_{t+1}(j) p(X_{t+1}|z_{t+1}=j) A_{ij} \alpha_t(i)
&= \Phi_t(i,j)
p(z_t=i,z_{t+1}=j|x_{1\dots T}) &= \frac{\Phi_t(i,j)}{\sum_{k_1,k_2} \Phi_t(k_1,k_2)}
\end{align*}
</latex>
Now in the M-step,
$z_t$ is known for some training data, but not at prediction time. Learning:
Prediction: compute $p(z_{t=j} | x_{1,\dots T})$ using the forward-backward algorithm
$z_t$ is unknown during training. $x_t$s are known. Learning: Use the Expectation Maximization algorithm.
#* Get this from the F-B algorithm #* Directly analogous to membership weights of a mixture model #* $O(K^2 T)$ per iteration
#* $O(K^2 T)$ per iteration
Autoregressive HMMs also contain dependencies directly between observed variables. It is a bit more complicated, but has been found to be useful in time-series analysis.
In a Markov model, the run-lengths are geometric. There are self-transition probabilities and the probabilities over duration of states are geometric. A Semi-Markov extension can create non-geometric run lengths. When a state transition occurs, a run-length is drawn from the model distribution. This has the disadvantage that it is no longer Markov and the forward-backward algorithm becomes $O(T K^2 L_{max})$ where $L_{max}$ can be as large as $T$.
Conditional Random Fields: Model $p(z_t | x_{1\dots T}, y)$ rather than $p(z_t , x_{1\dots T},y)$, where $y$ represents some set of non-local non-time-dependent features. Widely used in language modeling where e.g. $y$ is sentence length. These are like Markov random fields in two dimensions. You usually need labeled data to train.
HMM structures with continuous state variables. e.g. now the vector $z_t$ includes velocity values in addition to position. In the case where $p(x_t|z_t)$ is Gaussian, this is known as a linear dynamical system and the equivalent of the forward-backward algorithm is known as the Kalman filter.
In computer vision, there is typically a 2-dimensional array of hidden states representing a scene. These are referred to as Markov Random Fields.
It is common to have underflow or overflow problems when computing the chain of likelihoods. Log-space is typically used, and a constant scale may also be carefully maintained in order to avoid these numerical problems.