Machine Learning is a class of techniques for learning a model of data, frequently used in data mining and business intelligence.
Learning refers to learning the parameters of a model. It is also referred to as inference in Bayesian methods. Once a model has been learned, it might be used for predictive modeling or exploratory data analysis.
In a generative model, there is also usually a way to “go back” and generate or simulate data by fixing the parameters.
From a Bayesian framework, we would want to estimate something like $P(\theta|D) \propto P(D|\theta) P(\theta)$. The prior probability over $\theta$ can be controvercial, so ignoring it or using a flat prior, the likelihood alone is typically used to carry the parameter estimation.
$D = \{(x_i,c_i)\}$ for $i=1\dots N$, $\theta_i \in \{0,d\}$ and $c_i \in \{0,1\}$. Can write likelihood as $\prod_{i=1}^N p(x_i,c_i)$ using conditional independence.
<latex>\begin{align*}
p(c_i = 1|x_i,\theta) &= \prod_{i=1}^N p(c_i|x_i,\theta) & \text{assuming CI}
\end{align*}</latex>
This is a conditional model, but hasn't defined the actual probabilities yet. The one which is most commonly used in machine learning is the logistic function: $$ p(c_i=1|x_i) = \frac{1}{1 + e^{-(w^t x_i + w_0)}} $$ $$ p(c=1|x,\alpha) = \frac{1}{1 + e^{-(\alpha_0 + \sum_{j=1}^d \alpha_j x_j)}}$$
It defines a linear decision boundary. This is logistic regression, an example of a generalized linear model.
<Merge with Classification>
Learn a model for a joint distribution $p(x_i,c_i)$ to predict class label of a new vector $x^*$. We can compute $p(c=k|x^* , \hat\theta_{ML} , \hat \alpha_{ML})$ using Bayes' rule.
$D = \{(\underline{x}_i,c_i)\}$.
<latex>\begin{align*}
$L(\theta) = p(D|\theta) &= \prod_{i=1}^N p(x_i|c_i,\theta)p(c_i)
&= \prod_{k=1}^N \left[ \prod_{c_i=k} p(x_i|c_i=k,\theta_k)p(c_i=k) \right]
\end{align*}</latex>
Key Points
$p(x|c=k,\theta_k) = \mathcal{N}(\mu_k,\Sigma_k)$ Need to learn $K$ sets of parameters $\mu_k,\Sigma_k$, for $k=1\dotsK$. There is sensitivity to the Gaussian assumption. Due to $O(d^2)$ parameters per class, this can scale poorly as d increases. In practice, in high-dimensional problems you can assume that the $\Sigma_k$s are diagonal.
In Naive Bayes Classification, you model $p(x|c=k,\theta_k)$.
With sequence data, can do classification by learning a Markov Model for each class. $p(s_i|c_i=k,\theta_k) = \prod_{j=2}^L_i p(s_{ij}|s_{i,j-1},\theta_k)$
Treat the parameters $\theta$ as random variables. In particular, before observing any data, there is the prior $p(\theta)$, a prior density for $\theta$.
As more data is gathered, the role of the prior is reduced. It is more influential when data is limited.
<latex> \begin{align*}
&\propto p(d|\theta)p(\theta)
&\propto \text{Likelihood x Prior}
\end{align*}</latex>
$\frac{p(d|\theta)p(\theta)}{p(d)}$ is known as the posterior density. In comparison to Maximum Likelihood Estimation:
<latex>
\begin{tabular}{c c l @{\ \ $\Rightarrow$\ \ } l}
Maximum Likelihood Estimation & ML & $\hat \theta_{ML}$ & argmax_{\theta} p(D|theta)
Posterior Mode & MAP maximum a posteriori & $\hat \theta_{MAP}$ & argmax_\theta p(\theta|D)
Posterior Mean & MPE mean posterior estimate & $\hat \theta_{MPE}$ & $E_{p(\theta|D)} [\theta]$
Full posterior density & & &
\end{tabular}
</latex>
$\theta$ = mean weight of fish in a lake, assuming Gaussian likelihood. $$ p(\theta) = N(\mu, \sigma^2)$$ Here $\mu$ is the mean of the prior and $\sigma^2$ is variance of the prior.
$D = \{x_1, \dots , x_N\}$ , $x_i \in \{0,1\}$ , $p(x_i=1) = \theta$.
$L(\theta) = \theta^r (1 - \theta)^{N-r}$ , $r = \sum_{i=1}^N I(x_i=1)$
A common choice for a prior on $\theta$ is the Beta density: $p(\theta) = Beta(\theta|\alpha,\beta)$. <latex>\begin{align*}
&\propto \theta^{\alpha-1} (1-\theta)^{\beta-1}
\end{align*}</latex>
<latex> \begin{align*}
\text{mode}[\theta] &= \frac{\alpha-1}{\alpha+\beta-2}
Var[\theta] &= \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}
&\propto \frac{1}{2 \alpha + 1}
&\propto \frac{1}{\alpha}
\end{align*}</latex>
The posterior is also in a Beta form due to conjugacy.
<latex> \begin{align*}
&= \frac{r + \alpha}{N + \alpha + beta}
\end{align*}</latex>
The MPE effectively smooths the maximum likelihood estimate. The larger $\alpha$,$\beta$ are $\Rightarrow$ more smoothing. $\alpha$ and $\beta$ are referred to as pseudo counts because they effectively count the number of trials and successes of previous trials. $\alpha$ is the pseudocount for the number of successes, and $\alpha+\beta$ is the pseudocount for the prior number of trials.
As $N \to \infty$, MPE for $\theta \to \frac{r}{N} = \hat \theta_{ML}$. Variance of $p(\theta|D) \to 0$ as $N\to\infty$.
The posterior density is the same form as the prior when using a conjugate prior. In this case, the Beta is a conjugate prior to the Bernoulli.
<latex> \begin{align*}
p(\theta|D) &\propto p(D|\theta)p(\theta)
&= \theta^r (1-\theta)^{N-r} \theta^{\alpha-1} (1-\theta)^{\beta-1}
&= \theta^{r+\alpha-1} (1-\theta)^{N-r+\beta-1}
&= \theta^{\alpha'-1} (1-\theta)^{\beta'-1}
& & \alpha' = r+\alpha
& & \beta' = N-r+\beta
\end{align*}</latex>
$D = \{x_1, \dots , x_N\}$ , $x_i \in \{1, \dots , K\}$, $K>2$, $p(x_i=1) = \theta$. e.g. $x_i$ = occurrence of a word in a document. $K$ = number of unique words. $r_k$ = number of $x_i$'s taking value $k$ in $D$. $\sum_{k=1}^K r_k = N$. $\theta_k = p(x_i=k), \sum_{k=1}^N \theta_k = 1$.
<latex> \begin{align*}
&= \prod_{k=1}^K \theta_k^{r_k}
\hat \theta_{ML} &= \frac{r_k}{N}
\end{align*}</latex>
The maximum likelihood in this case may require smoothing if we don't want it assigning zero probabilities.
A conjugate prior to the multinomial is the Dirichlet distribution. This is a generalization of the Beta to higher dimensions.
<latex>\begin{align*}
&\propto \prod_{k=1}^K \theta_k^{\alpha_k-1} & \alpha_k > 0
\end{align*}</latex>
The $\alpha$ parameters are directly analogous to the Beta Binomial prior parameters. e.g. for text, the $\alpha$s could be proportional to frequency of words in English text.
The posterior density will have the form $p(\theta|D) = Dirichlet(\alpha')$ where $\alpha_k' = \alpha_k + r_k$.
The prior mean $E_{p(\theta)} [\theta_k] = \frac{\alpha_k}{sum \alpha_k}$. The posterior mean $E_{p(\theta|D)} [\theta_k] = \frac{r_k+\alpha_k}{N + sum \alpha_k}$.
Common in tracking problems. Assume that the movement of the object has some Gaussian noise to it.
$D = \{x_1, \dots , x_N\}$ , $x_i \in \{\mathbb{R}\}$, assuming $x_i \sim \mathcal{N}(\mu,\sigma^2)$, and $x_i$s are conditionally independent given $\theta$.
$L(\mu) \propto \prod_{i=1}^N e^{\frac{(x_i-\mu)^2}{2\sigma^2}}$
The conjugate prior is Gaussian.
$p(\mu) = N(\mu_0,s^2)$ where $\mu_0$ is the mean of the prior, and $s^2$ represents uncertainty about the prior.
Posterior <latex>\begin{align*}
&\propto \left( \prod_{i=1}^N e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \right) e^{-\frac{(\mu-\mu_0)^2}{2 s^2}}
&\propto e^{-\frac{1}{2} \left[ \sum_{i=1}^N \left( \frac{x_i-\mu}{\sigma}\right)^2 + \left( \frac{\mu-\mu_0}{s}\right)^2 \right]}
&\propto e^{-\frac{1}{2} (a\mu^2 + b\mu + c) } & \text{for some } a,b,c
&\propto e^{-\frac{1}{2} \left( \frac{\mu-\mu_N}{\sigma_N}\right)^2} & \text{where\ } \mu_N = \gamma \hat \mu_{ML} + (1-\gamma)\mu_0
& & \hat \mu_{ML} = \frac{1}{N} \sum_{i=1}^N x_i
& & \gamma = \frac{N s^2}{Ns^2 + \sigma^2}
\frac{1}{\sigma_N^2} &= \frac{N}{\sigma^2} + \frac{1}{s^2}
\end{align*}</latex>