Cory's Wiki

Classic model approach with roots in biology research.

Notation

$x = \begin{vmatrix} x_1
\vdots
x_d \end{vmatrix} $

$p(x) = \sum_{k=1}^K p_k(x|z_k,\theta_k) p(z_k) $ $p_k$ is called the kth mixture component. z can be a binary indicator vector. Only one $z_k$ equals one. Then let $\alpha$ be a mixture distribution over z. $\alpha_k = p(z_k)$

example: mixture of arbitrary densities of distributions $k=1$ indicates a Gaussian $k=2$ indicates an exponential $k=3$ indicates a Gamma

example: mixture of Gaussians $\theta_k = {\mu_k,\Sigma_k}, P_k()$ s a Gaussian density with parameters $\theta_k$. For mixture weights $\alpha = {0.6,0.4}$, $p(x) = 0.6 p_1(x|\dots) + 0.4 p_2(x|dots)$

example: mixture of conditionally independent Bernoulli trials $p_k(x|z_k,\theta_k) = \prod_{j = 1}^d p_k(x_j=1)^{x_j} * (1 - p_k(x_j=1))^{1-x_j} $

Applications of Mixture Models

They are a very flexible approach to density estimation. Allow writing a complex density as a combination of simpler ones. Appropriate when you want to model systems with real physical component phenomena.

Learning

Assume that each data point is generated from only a single component.

  • Generative model.
  • for i=1 to N:
  • k* ← sample a component for ith data point ~ p(z=k)
  • xi ← sample a data point from compenent k* ~ p_k(x|\theta_k,z=k*)

Maximum Likelihood

$\underline{\theta} = \{\underline{\theta_1} \dots \underline{\theta}_k,\alpha_1 \dots \alpha_k\}$

  • $\underline{\theta}_k$ are component parameters
  • $\alpha_k = p(z=k)$

<latex>\begin{align*} L(\theta) &= \log p(D|\theta)
&= \sum_{i=1}^N\log p(x_i|\theta)
&= \sum_{i = 1}^N \log \left( \sum_{k = 1}^K p(x_i|z_k,\theta_k) p(z_k) \right) \end{align*}</latex>

The problem with this approach is that the summation over unknown $z$ values is intractable in even simple cases.

K-Means

K-Means is the non-probabilistic version of EM.

Expectation Maximization

Expectation Maximization is typically used to solve these problems

Kernel Density Estimation

Kernel Density Estimation can work well in low dimensions but doesn't scale well.