Classic model approach with roots in biology research.
$x = \begin{vmatrix} x_1
\vdots
x_d \end{vmatrix} $
$p(x) = \sum_{k=1}^K p_k(x|z_k,\theta_k) p(z_k) $ $p_k$ is called the kth mixture component. z can be a binary indicator vector. Only one $z_k$ equals one. Then let $\alpha$ be a mixture distribution over z. $\alpha_k = p(z_k)$
example: mixture of arbitrary densities of distributions $k=1$ indicates a Gaussian $k=2$ indicates an exponential $k=3$ indicates a Gamma
example: mixture of Gaussians $\theta_k = {\mu_k,\Sigma_k}, P_k()$ s a Gaussian density with parameters $\theta_k$. For mixture weights $\alpha = {0.6,0.4}$, $p(x) = 0.6 p_1(x|\dots) + 0.4 p_2(x|dots)$
example: mixture of conditionally independent Bernoulli trials $p_k(x|z_k,\theta_k) = \prod_{j = 1}^d p_k(x_j=1)^{x_j} * (1 - p_k(x_j=1))^{1-x_j} $
They are a very flexible approach to density estimation. Allow writing a complex density as a combination of simpler ones. Appropriate when you want to model systems with real physical component phenomena.
Assume that each data point is generated from only a single component.
$\underline{\theta} = \{\underline{\theta_1} \dots \underline{\theta}_k,\alpha_1 \dots \alpha_k\}$
<latex>\begin{align*}
L(\theta) &= \log p(D|\theta)
&= \sum_{i=1}^N\log p(x_i|\theta)
&= \sum_{i = 1}^N \log \left( \sum_{k = 1}^K p(x_i|z_k,\theta_k) p(z_k) \right)
\end{align*}</latex>
The problem with this approach is that the summation over unknown $z$ values is intractable in even simple cases.
K-Means is the non-probabilistic version of EM.
Expectation Maximization is typically used to solve these problems
Kernel Density Estimation can work well in low dimensions but doesn't scale well.