Linear models are models that are linear in the parameter $\theta$. So the computation complexity is linear as a function of the number of parameters. e.g. $f(x ; \theta) = \sum_{j=1}^d \theta_j x_j = \theta^t x$
The functional form can have non-linear functional form while still being linear in the number of parameters. This is still considered a “linear” model.
Training data: set of $(x_i,y_i)$ pairs, where $y \in R$. $f(x; \theta)$ is the prediction model.
By using a linear model and a least-squares fitting method, the objective function remains convex.
A problem with linear models is that they can return results which are negative or otherwise not nicely interpretable as probabilities. A common solution is to use a logistic function and perform Logistic Regression.
link function
A Least Squares error function means $MSE(\underline{\theta})$ is a concave function, so we can use gradient descent to find $\hat \theta_{MSE}$
$\underline{\theta}^{new} = \underline{\theta}^{current} - \alpha \nabla ( MSE(\underline{\theta}^{current})$ where $\alpha$ is a step size and $\nabla$ is a gradient
$mse(\theta) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i ; \theta))^2 = (Y_D - X_D \underline{\theta})^T(Y_D-X_D\underline{\theta})$
Can show that mse is minimized when
If you have a low-dimensional problem and lots of data, this is “easy”. If it is a high-dimensional problem and some of them are codependent, then the problem can be underdetermined and difficult or impossible to solve this way.
Both methods have complexity $O(Nd^2 + d^3)$. The $Nd^2$ is for creating the $X_D^T X_D$ matrix, and the $d^3$ is for solving $d$ linear equations or inverting a $d$-by-$d$ matrix.
Because $MSE(\underline{\theta})$ is a concave function, in principle you can use gradient descent to find $\hat \theta_{MSE}$.
Assume (x_i,y_i) are IID samples from some underlying density $p(x,y) = p(y|x)p(x)$. $\displaystyle \lim_{n \to \infty} mse(\theta) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i ; \theta))^2 = \int \int (y_i - f(x_i ; \theta))^2 p(x,y) dy dx $ $\displaystyle = \int_x [ \int_y(y_x - f(x_i ; \theta))^2 p(y|x) dy] p(x) dx = \int MSE(\theta) p(x) dx$
When $MSE_x(\theta) = \int (y_i - f(x_i ; \theta))^2 p(y|x) dy $ The inner error term is the part affected by the parameters: $ = \int (y - E(y|x) + E(y|x) - f(x ; \theta))^2$ $ = \int (y - E(y|x))^2 p(y|x) dy + \int (E(y|x) - f(x ; \theta))^2 p(y|x)dy$
In practice, regression is also limited by the Bias-Variance Tradeoff.