“Where the money is made.” -Padhraic Smyth Most learning algorithms can be described as a combination of: * Model * Objective Function * Optimization Method
<latex>
\begin{tabular}{c c c c}
Method & Model & Objective Function & Optimization Method
\hline
Linear Regression & Weighted Sum & Squared Error & Linear systems of equations
Logistic Regression & h(Weighted Sum) & Log-Likelihood & Iterative, system of equations
Neural Network & Weighted sum of logistic regressions & Squared error & Gradient-based
Support vector machine & Sparse weighted sum & Margin & Convex optimization
Decision tree & Binary tree & Classification error & Greedy search over trees
\end{tabular}
</latex>
Classification * Spam email * Classify sentiment of a product review
Regression * Predict a real-valued number
Ranking * Find most likely candidates from a group
Training data: (features, labels) pairs typically represented in a table.
Want to learn a model to predict a label given features. The model is typically represented as a function $f(x;\alpha)$ where $x$ are input features and $\alpha$ is a vector of parameters.
The quality of a model is evaluated with an error function. * e.g. sum of squared error: $E_{train}(\alpha) = \sum_i \left[y_i - f(x_i;\alpha)]^2$.
Goal is to minimize the total error on training data. This is an optimization problem. There is occasionally a direct solution via e.g. linear algebra. Typically a gradient approach of some kind is needed.
Minimizing training error doesn't give the best possible prediction of future data. Increasing test error during training is called overfitting. Overfitting can be controlled by switching to a simpler model.
In practice, predictive models are limited by the Bias-Variance Tradeoff.
Linear weighted sums of the input variables (linear regression). Non-linear functions of linear weighted sums (logistic regression, neural networks, GLMs). Thresholded functions (decision trees).
To improve a model, model performance is important. Compare to a baseline. Relative to the error rate, you can measure the reduction in error provided by switching to the model in question. Want to establish that the reduction in error is not due to random chance. e.g. in classification, the simplest baselin is to always predict the most likely class, ignoring $x$. Alternately, examining a confusion matrix can explain mistakes or patterns in the classifier.
Regression * Squared Error (L2) * Absolute Error (L1) * Robust loss, log-loss, log-likelihood
Classification * classification error * margin * log-loss, log-likelihood
Frequently use gradient descent.