Michael's Wiki

Maximum Likelihood Estimation

MLE is a form of learning from data based on the likelihood technique. It can be done in closed form for simple models, but often requires numerical (e.g. gradient methods) in general.

What does it have to do with Gibbs Sampling?


In some model $M$ with parameters $\theta$ and a likelihood term of the form $P(\text{data}|\theta)$, and $N$ observations in data $D = {x_1,x_2,\dots x_N}$ where $x \in \{0,1\}$ binary variables.

$\text{Likelihood}(\theta:D) = L(\theta) = p(D|\theta) = p(x_1,x_2,\dots x_n|\theta)$

The likelihood is the probability of the observed data. Can be called $L(\theta)$ or $L(\theta|D)$.

The basic assumption of MLE is that when $L(\theta_1) > L(\theta_2)$, the model defined by $\theta_1$ is probably more accurate. Maximizing the parameter is maximum likelihood.

Because the likelihood is a product of $N$ probabilities, it tends to be very small. For numerical and analytical reasons, it is common to work with the log-likelihood. Max/min of the log-likelihood produces the same result as max/min of the likelihood.

Data Notation

D : data set ${x_1,x_2,\dots x_n}$ where each $x$ can be binary, categorical, real-valued values of a random variable. Each $x_i$ is of dimension $d$, so there are $n$ samples of length $d$ in an $nxd$ matrix of data.

  • * How are the samples related? Independence?
  • * How are the data features related?

$\hat{\theta}_{ML}$ is the maximum likelihood estimate of $\theta$.

Derivation Examples

Coin Tossing

$D = {x_1,x_2,\dots x_n}, x_i\in {0,1)$ Can assume a conditional independence model, with parameter $\theta = p(X_i=1)$

  • * The $x_i$ are mutually independent conditioned on $\theta$.

$$p(D|\theta) = p(x_1,x_2,\dots x_n|\theta)= \prod_{i=1}^N p(x_i|\theta)$$ $= \prod_{i=1} \theta^{x_i} (1-\theta)^{1-x_i}$ $= \left( \prod_{x_i=1} \theta \right) \left( \prod_{x_i=0} (1-\theta) \right)$ $= \left( \theta^r \right) \left( (1-\theta)^{(N-r)} \right)$

$\log(\theta) = r \log(\theta) + (N-r) \log(1-\theta) $

$\frac{d}{d\theta}(\log(\theta)) = \frac{r}{\theta} - \frac{N-r}{1-\theta} = 0 \text{at} \hat{\theta}_{ML}$

$ \frac{r}{\theta_{ML}} = \frac{N-r}{1-\theta_{ML}} $

$ \theta_{ML} = \frac{r}{N} $

Univariate Gaussian Example

Univariate Gaussian distribution, parameterized by $\mu,\sigma^2$.

Known Variance

Say $\sigma^2$ is known. Want to maximize $L(\mu)$: $$L(\mu) = argmax_\mu \left[ \prod_{i=1}^N \mathcal{N}(\mu,\sigma) \right]$$

Take logs, drop constant $\frac{1}{\sqrt{1\pi\sigma^2}}$ terms that don't depend on $\mu$. Maximizing the log-likelihood is, in this case, equivalent to minimizing the squared error of a model over the data. In this case, the max likelihood is equivalent to the observed mean.

Unknown Variance

Say $\sigma^2$ is unkown. Now $\theta$ includes the mean and covariance. $D = {x_1,x_2,\dots x_n}, x_i = d$-dimensional vector. Assume $p(x_i) = N(\mu,\Sigma)$, where $\mu, \Sigma$ are unknown.

  • * Data samples are independent
  • * Features correlated according to $\Sigma$.

$$p(D|\theta) = p(x_1,x_2,\dots x_n|\theta)= \prod_{i=1}^N p(x_i|\theta)$$

In this case it is easier to work with log-likelihood.


  • L(\theta) = p(D|\theta) &= \prod_{i=1}^N p(x_i|\theta)

\log L &= \sum_{i=1}^N \log p(x_i|\theta)
&= \sum_{i=1}^N \log \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x_i - \mu)}{2\sigma^2}}
&= \sum_{i=1}^N \log \frac{1}{\sqrt{2\pi}} - \log (\sigma^2)^{-1} + \frac{-(x_i - \mu)}{2\sigma^2}
\frac{d}{d\sigma^2} &= \sum_{i=1}^N \frac{d}{d\sigma^2} - \log (\sigma^2)^{-1} + \frac{-(x_i - \mu)}{2\sigma^2}
&= \dots \end{align*}</latex>

Multinomial Example

Multinomial model:

  • * $p(X=k) = \theta_k$
  • where $\theta = \{\theta_1,\dots \theta_K\}$ for $\ k = \{1,\dots K\}$
  • and $ \sum_{k=1}^K \left[\theta_k\right] = 1 $

Sufficient statistics: $n_k$ = the number of times result $k$ appears in the data.

<latex> \begin{align*}

  • p(D|\theta) &= \prod_{k=1}^K \theta_k^{n_k}

\log p(D|\theta) &= \log \prod_{k=1}^K \theta_k^{n_k}
&= \sum_{k=1}^K \log \theta_k^{n_k}
&= \sum_{k=1}^K n_k * \log \theta_k & \text{Insert Lagrange function }\left[\sum_{j=1}^K \theta_j = 1 \right]
&= \sum_{k=1}^K n_k * \log \theta_k - \lambda \left[\sum_{j=1}^K \theta_j - 1 \right]
\frac{d}{d \theta_i} &= \frac{d}{d \theta_i} \sum_{k=1}^K n_k * \log \theta_k - \lambda \left[\sum_{j=1}^K \theta_j - 1 \right]
&= \frac{d}{d \theta_i} \left[ n_k * \log \theta_k - \lambda \theta_j \right]
&= \frac{n_k}{\theta_k} - \lambda
\theta_k &= \frac{n_k}{\lambda} \end{align*} </latex> Plug this value of $\theta_k$ into the sum $\sum_{k=1}^K \theta_k = 1$. <latex> \begin{align*} \sum_{k=1}^K \frac{n_k}{\lambda} &= 1
\sum_{k=1}^K n_k &= \lambda & \text{Now use this value of lambda in the max likelihood}
\text{Max-Likelihood}( \theta_k )= \frac{n_k}{\sum_{j=1}^K n_j} \end{align*} </latex>

Uniform Example

$X$ is uniformly distributed with lower limit $a$ and upper limit $b$, where $b > a$. $$p(X<a) = p(X > b) = 0$$ else $$p(X=x) = \frac{1}{b-a}$$

Given data set $D = \{x_1, \dots , x_N\}$, we know that $a \leq \min(D)$ and $b \geq \max(D)$.

$$\text{Likelihood}(D|a,b) &= P(D|a,b) = \prod_{i=1}^N \frac{1}{b-a} = (b-a)^{-N}$$

This increases as $a$ increases. Because $a$ is bounded by $\min(D)$, the maximum likelihood estimator for a is then $\min(D)$.

Naive Bayes MLE Classification Example

$x$ = d-dimenional feature vector $c$ = class variable (categorical); $c \in {1,\dots,K}$

Training data: $D = {(x_1,c_1),(x_2,c_2),\dots (x_N,c_N)}$ where $x_i = \begin{pmatrix} x_{i1}
x_{iN} \end{pmatrix} $ and $x_{ij} \in \{0,1\}$, for $i = 1\dots N$ and $j = 1\dots d $.

Classification problem: Learn or estimate $P(c_k|x_k), k = 1,\dots K$ for classification of new $x$.

Assume that $(x_i,c_i)$ pairs are conditionally independent given model parameters <latex> \begin{align*}

  • \{\theta, \phi\} \Rightarrow L(\theta,\phi) &= \prod_{i = 1}^N \left[ p(x_i, c_i | \theta,\phi) \right]

&= \prod_{i = 1}^N \left[ p(x_i| c_i, \theta) p(c_i |\phi) \right]
&= \prod_{i = 1}^N \left[ p(c_i| x_i, \theta) p(x_i |\phi) \right]
&= \prod_{i = 1}^N \left[ p(x_i| c_i, \theta) \right] \prod_{i = 1}^N \left[ p(c_i |\phi) \right]
&= \prod_{i = 1}^N \left[ p(c_i| x_i, \theta) \right] \prod_{i = 1}^N \left[ p(x_i |\phi) \right] \end{align*} </latex>

By separating into multiple terms, each dependent on one unknown variable, each can be maximized or minimized independently.

$\prod_{i = 1}^N \left[ p(x_i| c_i, \theta) \right] = \prod_{k = 1}^K \left( \prod_{x_i:c_i = k} \left[ p(x_i| c_i=k, \theta_k) \right] \right)$

Solution for $\sum_{i=1}^N \log p(c_i|\phi)$ will be of the form $\delta_{ML,K} = \frac{N_k}{N}$ where $N_k = \sum_{i=1}^N I(c_i=k)$.

Generalizations: For $\ubar{x}_i \in \mathbb{R}^d$, for Naive Bayes we assume $p(x_i|c_i=k) = \prod_{j=1}^d p(x_{ij}|c_i=k)$. e.g. all d densities could be Gaussian, this is the product of 1-dimensional density functions.

Theoretical Properties of Maximum Likelihood

Let $q(x)$ be the data generating distribution a.k.a. data-generating function, and data D be independent and identically distributed observations. Let $p_m(x|\theta)$ be our model, which might not include q.

The [http://en.wikipedia.org/wiki/Kullback-Leibler_Distance Kullback-Leibler Divergence] = $KL(q,p) = \int_{-\infty}^{\infty} q(x) \log \frac{q(x)}{p(x)} dx \ge 0 , \forall x$ with equality iff $p(x) = q(x)$.

Let $E = E_{q(x)} \left[ f(x)\right] = \int q(x) f(x) dx$. Given a random (IID) sample set $x_1,\dots x_N$ from $q(x)$. Define $\hat E = \frac{1}{N} \sum_{i=1}^N f(x_i)$. $\lim_{N \to \infty} [\hat E] = E_{q(x)} [f(x)]$.

<latex> \begin{align*}

  • \frac{1}{N} \log L(\theta|D) &= \frac{1}{N} \sum_{i = 1}^N log p_m(x_i|\theta)
  • & \text{take limit}

lim \left( \frac{1}{N} \log L(\theta|D) \right) &= \lim_{N \to \infty} \left[ \frac{1}{N} \sum_{i = 1}^N log p_m(x_i|\theta) \right]
&= E_{q(x)} \left[ log p_m(x_i|\theta) \right]
&= \int q(x) \log p_m(x|\theta) dx
&= - \int q(x) \log \frac{q(x)}{p(x|\theta)} dx - \int q(x) \log \frac{1}{q(x)} dx

  • & \text{The second term is independent of $\theta$}

&\Rightarrow - \int q(x) \log \frac{q(x)}{p(x|\theta)} dx
&= - KL \left(q(x),p(x|\theta)\right)
&\Rightarrow \text{maximizing $\frac{1}{n} L(\theta|D)$ is equivalent to minimizing KL distance} \end{align*} </latex>

In general, if we have a “model mis-specification”, $p(x|\theta)$ gets as close as possible in a KL-snse to $q(x)$ as $N \to \infty$.


Max-likelihood is a point estimate given observed data. It puts very high emphasis on the data by dismissing priors. This can be a bad representation when the data is unreliable or may lead to excluding future observations.