Gradient methods are commonly used for expectation maximization, logistic regression, and other types of supervised learning.
Given a parameterized error function, want to learn the best setting for the parameters to minimize error.
First-order methods take a first-order derivative of the error function. For large number of parameters, second-order methods converge more slowly in practice than first-order methods.
$\underline{\theta}^{new} = \underline{\theta}^{current} - \alpha \nabla ( MSE(\underline{\theta}^{current}))$ where $\alpha$ is a step size and $\nabla$ is a gradient
Example data: $Y_{j i} | x_j$ for $i = 1\dots n$.
Compute the gradient $\nabla_\theta (E)$ over all data points, then update the parameters.
Basic idea: compute the gradient on a small number of data points (mini-batch) at a time, or one at a time.
It results in cheap and noisy updates, but can converge faster than batch gradient descent for large $N$. In practice, stochastic gradient descent is faster than the batch method. May find convergence before even passing through the data points once.
With sparse data, can update based only on features that are present in the example. So if the observations only have average $f$ non-zero features out of $d$ features total, an update becomes $O(f)$ instead of $O(d)$.
Criterion for “Best”. Convention is to use the “Least Squares” method.
$SSE(\beta_0,\beta_1) = \sum_{i = 1}^n \left( Y_i - (\beta_0 + \beta_1 x_i)\right)^2 $
$(\hat \beta_0,\hat \beta_1) = arg min_{(\beta_0,\beta_1)} SSE(\beta_0,\beta_1) $
$(Y|x) \sim \mathcal{N}\left(\mu(x) , \sigma^2 (x)\right)$ $\mu(x_i) = \beta_0 + \beta_1 x_i$ $\sigma^2(x_i) = \sigma^2$
Point estimator for $\mu(x_1)$ is:
The actual estimate using real data is then
$Y_i = \mu(x_i) + \epsilon_i = \beta_0 + \beta_1 x_i + \epsilon_i$
A general optimization scheme, it is a classic way to solve logistic regression. Want to use second-order derivatives to optimize the iterative update.
Weight update: $w_{new} = w_{old} + H^{-1}_w \nabla l(w_{old})$ (derived by second-order Taylor expansion)
$H_w$ is the [http://en.wikipedia.org/wiki/Hessian_Matrix Hessian matrix] of second-order partial derivatives with respect to data dimensions, evaluated at $w$. $H_{i,j} = \frac{\partial}{\partial w_i} \frac{\partial}{\partial w_j} l(w)$
$\nabla l(w)$ is a $dx1$ vector of partial derivatives evaluated at $w$. $\nabla_i l(w) = \frac{\partial}{\partial w_i} l(w)$.
This is similar to gradient ascent: $w_{new} = w_{old} + \epsilon \nabla l(w)$. Instead of the gradient ascent step size, we're using the Hessian to go straight to an optimum.
$\nabla l(w) = \sum_{i = 1}^N (f_i - c_i) x_i $ $\nabla l(w) = X^T (f - c)$
<latex>\begin{align*}
&= - X^T R X
& & f_i = f(x_i ; w)
& & \text{R is a diagonal matrix such that\ } R_{ii} = f_i(1-f_i)
\end{align*}
</latex>
For d dimensions on N data, each iteration of the update is O(d^3 + Nd).
In high dmensions, this cost can be mitigated somewhat by regularizing
: use $(X^T R X + \alpha I)^{-1}$.
This is also related to statistical linear regression (a.k.a. weighted least squares updates a.k.a. iteratively reweighted least squares), in that a linear regression problem is solved at each iteration. $w_{new} = w_{old} - (X^T R X)^{-1} X^T (f - c) = (X^T R X)^{-1} X^T R z$ where $z = X w_{old} - R^{-1}(f-c)$
See notes in [http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online the Barber text] and [http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf notes by Charles Elkan]
It can also be useful to normalize features to have mean 0 and standard deviation 1.