Given some observation features, want to make predictions on another variable. When the target variable is discrete, this is a form of classification. When the target variable is continuous, it is a regression problem.
$D = \{(\underline{x_i},\underline{y_i})\}$
$\underline{x} = \begin{pmatrix} x_1
\vdots
x_d \end{pmatrix} $
Want to predict the $y$s based on some $x$s. Learn a model $f(\underline{x} ; \underline{\theta})$, $\underline{\theta} = \begin{pmatrix} \theta_1
\vdots
\theta_d \end{pmatrix} $
Linear models with squared error are historically well-used.
Includes Linear Regression and Logistic Regression.
Generally, linear models which have linear complexity in the number of parameters as data increases, can be non-linear in the $x$s.
e.g. $ f(x;\theta) = \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2 $ is non-linear in $x$, but equals $\underline{\theta}^T \underline{x}'$
If $f(x;\theta)$ is non-linear in $\theta$, you get a set of non-linear equations to solve. Gradient techniques are often used here to find the $\theta$ that minimizes mean-squared error.
There is usually an implicit assumption that the training observations are IID samples from an underlying distribution $p(x,y)$. So the underlying problem contains a joint distribution $p(x,y) = p(y|x)p(x)$.
There are two types of variation to account for in data:
There are two sources of variability in the regression function:
Typical modeling framework: $y_x = E[y|x] + e$
Assuming that $y_i$s are conditionally independent given $\theta$ and $x$, conditional likelihood: <latex>\begin{align*}
&= \prod{i=1}^N p(y_i|x_i,\underline{\theta})
\end{align*}</latex>
Say we assume a Gaussian noise model: $p(y|x,\theta) = \mathcal{N}(f(x;\theta),\sigma^2)$. $f(x;\theta)$ is $E[y|x]$ and can be any linear or non-linear function with parameters $\theta$. $\sigma^2$ could also be an unknown parameter, but for now assume it is known.
$p(y|x,\underline{\theta}) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} \left( y_i - f(x_i;\theta) \right)^2} $
Maximizing likelihood is the same as minimizing MSE. <latex>\begin{align*}
&= \sum{i=1}^N \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} \left( y_i - f(x_i;\theta) \right)^2}
&\propto -\text{MSE}(\theta)
\end{align*}</latex>
$\sigma^2$ may be modeled as a function of $x$. May use non-Gaussian noise models, non-IID models, or Bayesian with priors on the parameters $\theta$. This goes beyond minimizing MSE.
<latex>\begin{align*}
p(\theta_j &= \mathcal{N}(0,\sigma^2) & p(\theta) = \prod_{d=1}^D p(\theta_j)
&\rightarrow -\log p(\theta|D_x,D_y)
&\propto -\log p(D_y|D_x,\theta) - \log p(\theta)
&= \frac{1}{2 \sigma^2} \sum_{i=1}^N (y_i - f(x_i,\theta)^2 + \frac{1}{\sigma_\theta^2} \sum_{j=1}^d \theta_j^2
&= \frac{1}{2 \sigma^2} \text{MSE}(\theta) + \frac{1}{2 \sigma^2} \sum_{j=1}^d \theta_j^2
\end{align*}</latex>
The first term is goodness of fit. The second is the regularization term. $\hat \theta_{MAP}$ is the minimization of this expression.
Where $\theta = y|x$, we want to learn <latex>\begin{align*}
&= p(D_y|D_x,\theta)p(D_x|\theta)p(\theta)
&\propto p(D_y|D_x,\theta)p(theta)
\end{align*}</latex>
So there is no modeling of $p(D_x)$.
Gaussian error model for conditional likelihood: $L(\theta) = p(D_y|D_x,\theta) \propto \prod_{i=1}^N e^{-1/2\left(\frac{y_i-f(x_i;\theta)}{\sigma}}\right)^2$. This models the $y_i$s as conditionally independent given $x_i$s and $\theta$. $$p(y_i|x_i) = \mathcal{N}(f(x_i;\theta),\sigma^2)$$ It is common to have independent priors on the $\theta_j$s: $p(\underline{\theta}) = \prod_{j=1}^d p(\theta_j)$
A conjugate prior for $\theta_j$ is another Gaussian. $p(\underline{\theta}) = \prod_{j=1}^d \mathcal{N}(0,\sigma_0^2)$
<latex>\begin{align*}
&\propto - \frac{1}{2\sigma^2} \sum_{i=1}^N \left(y_i-f(x_i,\underline{\theta})\right)^2 - \frac{1}{2\sigma_0^2} \sum_{j=1}^d \theta_j^2
- \log p(\theta|D_x,D_y) &\propto \sum_{i=1}^N \left(y_i-f(x_i,\underline{\theta})\right)^2 + \lambda \sum_{j=1}^d \theta_j^2
&= \text{squared error\ } + \lambda(\text{sum of squared weights}) & \text{where\ } \lambda=\frac{\sigma^2}{\sigma_0^2}
\end{align*}</latex>
$\hat \theta_{MAP}$ is the maximization of this expression, which is typically done with gradient techniques.
A less common prior is Laplacian, which corresponds to absolute error. This is known as the [http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method Lasso method].
$$\min MSE(\theta) = \min_{\theta} \frac{1}{N}\left(y_i - f(x_i;\theta)\right)^2$$ Minimizing something like this on training data with intent to use it on predictions later assumes that the $x_i$s and $y_i$s are random samples from a fixed underlying distribution $p(x,y)=p(y|x)p(x)$. This assumption should be kept in mind.
$$\lim_{N \to \infty} MSE(\theta) = \int \int \left(y - f(x;\theta)\right)^2 p(x,y) dx dy $$ $ = E_{p(x,y)} \left[ \left(y - f(x;\theta)\right)^2 \right] $ = averate theoretical squared error using f as a predictor, with respect to $p(x,y)$.
This is true as long as $x_i,y_i$ pairs are samples from $p(x,y)$. Ideally, we would like to fin $f,\theta$ to minimize this.
Can rewrite as $\int \left[ \int \left(y_x - f(x_i;\theta)\right)^2 p(y|x) dy \right] p(x) dx $.
$= \int MSE_x p(x) dx$ = weighted MSE with respect to $p(x)$. This is relevant in practical problems where $p(x)$ is different from training and test data. Changes in $p(x)$ are sometimes called covariate shift.
$$ MSE_x = \int \left(y_x - f(x_i;\theta)\right)^2 p(y|x) dy $$ $$ = \int \left(y_x - E[y_x] + E[y_x] - f(x;\theta) \right)^2 p(y|x) dy $$ cross terms drop out $$ = \int \left(y_x - E[y_x] \right)^2 p(y|x) dy + \int \left[E[y_x] - f(x;\theta) \right]^2 p(y|x) dy $$ The first term is $\sigma_{y|x}^2$, variation in y at a particular x. We have no control over this. We can then work to minimize the second term by selecting $\theta$. $$ = \sigma_{y|x}^2 + \int \left[E[y_x] - f(x;\theta) \right]^2 p(y|x) dy $$ The lower bound is achieved when we have the optimal predictor $f(x;\theta) = E[y_x]$. In this case, $MSE_x \geq \sigma_{y|x}^2$.
In theory, we just set $f(x;\theta) = E[y_x]$ and we have the optimal predictor. In practice, we have to learn $E[y_x]$ and are limited by the Bias-Variance Tradeoff.