Cory's Wiki

See General derivation of the EM Algorithm: pages 404-406 in Barber, pages 363-369 in Murphy

Expectation Maximization is an iterative unsupervised learning model for maximizing likelihood, typically in the presence of missing data (such as finite mixture models component memberships).

In the “E-step”, you take the expectation of the missing data. Then the “M-step” maximizes likelihood, before updating the missing data expectations again.

Guaranteed to at least find a local maximum, not guaranteed to converge to a global maximum. The log-likelihood should increase monotonically. Good idea to run from multiple starting points and then keep the solution with the highest log-likelihood. It can be thought of as a type of local hill climbing where the direction in $\theta$-space and the step size are chosen automatically. Convergence depends on the amount of missing information.

Can be combined with other optimization techniques

Intractible Models

If the E-step is intractable, can approximate the true membership weights by sampling e.g. Monte Carlo methods.

If the M-step is intractable, may require an iterative optimization algorithm which is run for every M-step.

  • * or use GEM Generalized EM, where the M-step moves up-hill in $\log L(\theta)$-space, but doesn't necessarily find a maximum.
Online EM

Standard EM is a batch algorithm, in that it considers all data points during an iterative update. Alternatively, the online version can update parameters based on random subsets of points. An extreme case is an update after just a single new data point.

For [ online EM], convergence is still guaranteed so long as the update method is modified correctly. It can converge faster and even achieve better results than the batch method.

see p365-366 in the Murphy text.

Generalized EM

In the M-step, instead of maximizing the expected log-Likelihood($\theta$), just move uphill.

Still has the same overall convergence properties, but may be useful when the normal M-step can't be computed in closed form.

As an algorithm for learning mixture models

Data $D = { (x_i, z_i)$ such that $z_i$ is a hidden indicator vector of which component generated $x_i$.

  • * Assume a mixture model with known $k$ and the functional form $p_k(x|z,\theta_k)$.
  • * Don't know $\theta, \alpha$, or $z$.
  • * Want to learn $\theta, \alpha$
  • These can be used with the mixture model to learn a distribution for $z$
  1. Initialize parameters at $\theta^0$
  2. An iteration:
    1. E-step: Compute $E \left[ z_{ik} | x_i, \theta^{current} \right]$ with respect to $p(z_{ik}|x_i,\theta^{current})$, where $i \in 1 \dots N$ and $j \in 1 \dots K$

##* <latex>\begin{align*}

  • p(z_{ik}|x_i,\theta^{current}) &= p(z_{ik}|x_i,\theta^{current})
    &= p(x_i|z_{ik}= 1,\theta^{current}) p(z_{ik} = 1|\theta^{current}) \end{align*}</latex> normalized for each i.

##* The result is a distribution over classes for each data point

  1. M-step: update parameters with maximum-likelihood given the results of the E-step

##* This is a maximization of the expected log-likelihood because we are working with a distribution over the $z$ variables.

  1. Check for convergence

##* Can check changes in the parameters after M-step, but may not know the scale of the parameters ##* Can check log-likelihood changes in E-step

EM for Gaussian Mixtures

Calculations: E-step (given $\theta$ model parameters)

  1. $w_{ik} = p(z_{ik}|x_i,\theta^{old}) = p_k(x_i|z_{ik},\theta^{old}) p(z_{ik}|\theta^{old}) / C$

#* $= p_k(x_i|z_{ik},\theta^{old}) \alpha_k^{old} / C$

  1. $p(x_i|\dots) = \sum_{m=1}^K \left( p_m (x_i | z_{im},\theta_m^{old}) \alpha_m \right)$

#* = (Gaussian density for kth component) * ($\alpha_k^{old})$ / C Complexity $O(NKd^2 + Kd^3)$

M-step (given $w_{ik}$ membership weights) Let $N_k = \sum_{i=1}^N w_{ik}$ = expected number of data points assigned to component k under current weights.

  1. $alpha_k^{new} = \frac{N_k}{N}$
  2. $\mu_k^{new} = \frac{1}{N_k} \sum_{i=1}^N w_{ik} x_i$ = (likelihood of belonging to comp. k) * (ith data point)
  3. $\Sigma_k^{new} = \frac{1}{N_k} \sum_{i=1}^N w_{ik} (x_i - \mu_k^{new}) (x_i - \mu_k^{new})^t$


  1. $\log l(\Theta) = \sum_{i=1}^N \log p(x_i|\Theta) = \sum_{i=1}^N \left( \log \sum_{k=1}^K \alpha_k p_k(x_i|z_k,\theta_k)\right)$

Hacks: There is a risk that one cluster will become responsible for just a single data point. To avoid this spurious solution, the covariance of each hat one cluster will become responsible for just a single data point. To avoid this deviant behavior, the covariance of each cluster can have its diagonals forced to stay away from zero.

A better method to fix this is to use MAP instead of ML on the M-step. See Murphy text p355-356.

EM for Binary Data Mixtures


  • p(x_i) &= \sum_{k=1}^K p(x_i|\theta_k,z_{ik}=1)

\theta_{jk} &= p(x_{ij}=1|z_{ik}=1) \end{align*}</latex>

E-step: $p(x_i|z_{ik}=1,\theta^{current}) = \prod_{d=1}^D \theta_{dk}^{x_{ij}}(1-\theta_{jk})^{1-x_{ij}}$

M-step: $\alpha_k^{new} = \frac{N_k}{N}$ $\theta_{jk}^{new} = \frac{1}{N_k}\sum_{i=1}^N w_{ik}s_{ij}$

EM For Markov Clusters

Where a cluster is modeled as a Markov Chain.

E-step: For each sequence 1 to K, estimate $p(c = k | s_k)$.

M-step: given $p(c = k | s_k)$, estimate $A_k$ and $P_{k0}$ fo reach cluster, the transition matrix and starting position of each cluster.

Complexity: See class slides

EM for Semisupervised Learning

In a semisupervised learning environment, there are labels for some data observations, but not all. An EM approach treats missing labels as missing data. So the $p(z_{ik}=1|x_{i},\theta)$ is fixed for the known labels, but learned for unlabeled data.

The benefit to using this unlabeled data and semisupervised approach depends highly on the correctness of the model.

Convergence Issues

The rate of convergence is a function of the amount of missing information. More missing information leads to larger variances, which in turn leads to greater overlap of cluster models.

Singular Solutions

Especially when N is small relative to K, may have $\sigma^2 \to zero$ and $\log L(\theta) \to \infty$. One solution is to enforce a lower constraint on $\sigma^2$ in the M-step. A more principled approach is to use prior $p(\theta)$ on the parameters and do MAP in the M-step.

See Murphy text 11.4 about Wishart priors: $\Sigma_k^{N,MAP} = \alpha \Sigma_k^{N,ML} + (1-\alpha)\Sigma_0 $