MLE is a form of learning from data based on the likelihood technique. It can be done in closed form for simple models, but often requires numerical (e.g. gradient methods) in general.
What does it have to do with Gibbs Sampling?
In some model $M$ with parameters $\theta$ and a likelihood term of the form $P(\text{data}|\theta)$, and $N$ observations in data $D = {x_1,x_2,\dots x_N}$ where $x \in \{0,1\}$ binary variables.
$\text{Likelihood}(\theta:D) = L(\theta) = p(D|\theta) = p(x_1,x_2,\dots x_n|\theta)$
The likelihood is the probability of the observed data. Can be called $L(\theta)$ or $L(\theta|D)$.
The basic assumption of MLE is that when $L(\theta_1) > L(\theta_2)$, the model defined by $\theta_1$ is probably more accurate. Maximizing the parameter is maximum likelihood.
Because the likelihood is a product of $N$ probabilities, it tends to be very small. For numerical and analytical reasons, it is common to work with the log-likelihood. Max/min of the log-likelihood produces the same result as max/min of the likelihood.
D
: data set ${x_1,x_2,\dots x_n}$ where each $x$ can be binary, categorical, real-valued values of a random variable. Each $x_i$ is of dimension $d$, so there are $n$ samples of length $d$ in an $nxd$ matrix of data.
$\hat{\theta}_{ML}$ is the maximum likelihood estimate of $\theta$.
$D = {x_1,x_2,\dots x_n}, x_i\in {0,1)$ Can assume a conditional independence model, with parameter $\theta = p(X_i=1)$
$$p(D|\theta) = p(x_1,x_2,\dots x_n|\theta)= \prod_{i=1}^N p(x_i|\theta)$$ $= \prod_{i=1} \theta^{x_i} (1-\theta)^{1-x_i}$ $= \left( \prod_{x_i=1} \theta \right) \left( \prod_{x_i=0} (1-\theta) \right)$ $= \left( \theta^r \right) \left( (1-\theta)^{(N-r)} \right)$
$\log(\theta) = r \log(\theta) + (N-r) \log(1-\theta) $
$\frac{d}{d\theta}(\log(\theta)) = \frac{r}{\theta} - \frac{N-r}{1-\theta} = 0 \text{at} \hat{\theta}_{ML}$
$ \frac{r}{\theta_{ML}} = \frac{N-r}{1-\theta_{ML}} $
$ \theta_{ML} = \frac{r}{N} $
Univariate Gaussian distribution, parameterized by $\mu,\sigma^2$.
Say $\sigma^2$ is known. Want to maximize $L(\mu)$: $$L(\mu) = argmax_\mu \left[ \prod_{i=1}^N \mathcal{N}(\mu,\sigma) \right]$$
Take logs, drop constant $\frac{1}{\sqrt{1\pi\sigma^2}}$ terms that don't depend on $\mu$. Maximizing the log-likelihood is, in this case, equivalent to minimizing the squared error of a model over the data. In this case, the max likelihood is equivalent to the observed mean.
Say $\sigma^2$ is unkown. Now $\theta$ includes the mean and covariance. $D = {x_1,x_2,\dots x_n}, x_i = d$-dimensional vector. Assume $p(x_i) = N(\mu,\Sigma)$, where $\mu, \Sigma$ are unknown.
$$p(D|\theta) = p(x_1,x_2,\dots x_n|\theta)= \prod_{i=1}^N p(x_i|\theta)$$
In this case it is easier to work with log-likelihood.
<latex>\begin{align*}
\log L &= \sum_{i=1}^N \log p(x_i|\theta)
&= \sum_{i=1}^N \log \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x_i - \mu)}{2\sigma^2}}
&= \sum_{i=1}^N \log \frac{1}{\sqrt{2\pi}} - \log (\sigma^2)^{-1} + \frac{-(x_i - \mu)}{2\sigma^2}
\frac{d}{d\sigma^2} &= \sum_{i=1}^N \frac{d}{d\sigma^2} - \log (\sigma^2)^{-1} + \frac{-(x_i - \mu)}{2\sigma^2}
&= \dots
\end{align*}</latex>
Multinomial model:
Sufficient statistics: $n_k$ = the number of times result $k$ appears in the data.
<latex> \begin{align*}
\log p(D|\theta) &= \log \prod_{k=1}^K \theta_k^{n_k}
&= \sum_{k=1}^K \log \theta_k^{n_k}
&= \sum_{k=1}^K n_k * \log \theta_k & \text{Insert Lagrange function }\left[\sum_{j=1}^K \theta_j = 1 \right]
&= \sum_{k=1}^K n_k * \log \theta_k - \lambda \left[\sum_{j=1}^K \theta_j - 1 \right]
\frac{d}{d \theta_i} &= \frac{d}{d \theta_i} \sum_{k=1}^K n_k * \log \theta_k - \lambda \left[\sum_{j=1}^K \theta_j - 1 \right]
&= \frac{d}{d \theta_i} \left[ n_k * \log \theta_k - \lambda \theta_j \right]
&= \frac{n_k}{\theta_k} - \lambda
\theta_k &= \frac{n_k}{\lambda}
\end{align*}
</latex>
Plug this value of $\theta_k$ into the sum $\sum_{k=1}^K \theta_k = 1$.
<latex>
\begin{align*}
\sum_{k=1}^K \frac{n_k}{\lambda} &= 1
\sum_{k=1}^K n_k &= \lambda & \text{Now use this value of lambda in the max likelihood}
\text{Max-Likelihood}( \theta_k )= \frac{n_k}{\sum_{j=1}^K n_j}
\end{align*}
</latex>
$X$ is uniformly distributed with lower limit $a$ and upper limit $b$, where $b > a$. $$p(X<a) = p(X > b) = 0$$ else $$p(X=x) = \frac{1}{b-a}$$
Given data set $D = \{x_1, \dots , x_N\}$, we know that $a \leq \min(D)$ and $b \geq \max(D)$.
$$\text{Likelihood}(D|a,b) &= P(D|a,b) = \prod_{i=1}^N \frac{1}{b-a} = (b-a)^{-N}$$
This increases as $a$ increases. Because $a$ is bounded by $\min(D)$, the maximum likelihood estimator for a is then $\min(D)$.
$x$ = d-dimenional feature vector $c$ = class variable (categorical); $c \in {1,\dots,K}$
Training data: $D = {(x_1,c_1),(x_2,c_2),\dots (x_N,c_N)}$ where $x_i = \begin{pmatrix} x_{i1}
x_{i1}
\vdots
x_{iN} \end{pmatrix} $ and $x_{ij} \in \{0,1\}$, for $i = 1\dots N$ and $j = 1\dots d $.
Classification problem: Learn or estimate $P(c_k|x_k), k = 1,\dots K$ for classification of new $x$.
Assume that $(x_i,c_i)$ pairs are conditionally independent given model parameters <latex> \begin{align*}
&= \prod_{i = 1}^N \left[ p(x_i| c_i, \theta) p(c_i |\phi) \right]
&= \prod_{i = 1}^N \left[ p(c_i| x_i, \theta) p(x_i |\phi) \right]
&= \prod_{i = 1}^N \left[ p(x_i| c_i, \theta) \right] \prod_{i = 1}^N \left[ p(c_i |\phi) \right]
&= \prod_{i = 1}^N \left[ p(c_i| x_i, \theta) \right] \prod_{i = 1}^N \left[ p(x_i |\phi) \right]
\end{align*}
</latex>
By separating into multiple terms, each dependent on one unknown variable, each can be maximized or minimized independently.
$\prod_{i = 1}^N \left[ p(x_i| c_i, \theta) \right] = \prod_{k = 1}^K \left( \prod_{x_i:c_i = k} \left[ p(x_i| c_i=k, \theta_k) \right] \right)$
Solution for $\sum_{i=1}^N \log p(c_i|\phi)$ will be of the form $\delta_{ML,K} = \frac{N_k}{N}$ where $N_k = \sum_{i=1}^N I(c_i=k)$.
Generalizations: For $\ubar{x}_i \in \mathbb{R}^d$, for Naive Bayes we assume $p(x_i|c_i=k) = \prod_{j=1}^d p(x_{ij}|c_i=k)$. e.g. all d densities could be Gaussian, this is the product of 1-dimensional density functions.
Let $q(x)$ be the data generating distribution a.k.a. data-generating function, and data D be independent and identically distributed observations. Let $p_m(x|\theta)$ be our model, which might not include q.
The [http://en.wikipedia.org/wiki/Kullback-Leibler_Distance Kullback-Leibler Divergence] = $KL(q,p) = \int_{-\infty}^{\infty} q(x) \log \frac{q(x)}{p(x)} dx \ge 0 , \forall x$ with equality iff $p(x) = q(x)$.
Let $E = E_{q(x)} \left[ f(x)\right] = \int q(x) f(x) dx$. Given a random (IID) sample set $x_1,\dots x_N$ from $q(x)$. Define $\hat E = \frac{1}{N} \sum_{i=1}^N f(x_i)$. $\lim_{N \to \infty} [\hat E] = E_{q(x)} [f(x)]$.
<latex> \begin{align*}
lim \left( \frac{1}{N} \log L(\theta|D) \right) &= \lim_{N \to \infty} \left[ \frac{1}{N} \sum_{i = 1}^N log p_m(x_i|\theta) \right]
&= E_{q(x)} \left[ log p_m(x_i|\theta) \right]
&= \int q(x) \log p_m(x|\theta) dx
&= - \int q(x) \log \frac{q(x)}{p(x|\theta)} dx - \int q(x) \log \frac{1}{q(x)} dx
&\Rightarrow - \int q(x) \log \frac{q(x)}{p(x|\theta)} dx
&= - KL \left(q(x),p(x|\theta)\right)
&\Rightarrow \text{maximizing $\frac{1}{n} L(\theta|D)$ is equivalent to minimizing KL distance}
\end{align*}
</latex>
In general, if we have a “model mis-specification”, $p(x|\theta)$ gets as close as possible in a KL-snse to $q(x)$ as $N \to \infty$.
Max-likelihood is a point estimate given observed data. It puts very high emphasis on the data by dismissing priors. This can be a bad representation when the data is unreliable or may lead to excluding future observations.