The bias-variance tradeoff is a practical issue that applies to many predictive modeling problems, particularly regression methods using gradient descent.
There is a fundamental tradeoff between vias and variance. This is known as the no free lunch theorem. Overcomplete or over-complex models tend to suffer from variance. Over-simplistic models tend to suffer from bias.
a.k.a. approximation error. Bias: expected difference between our model's predictions and the true targets. Can be reduced by making the model more complex and flexible. A high-bias model is one with few parameters e.g. a linear predictor. A low-bias model is one with many parameters e.g. a large neural network.
a.k.a. estimation error. Variance: variability in the model's predictions based on observations. Can be reduced by increasing the number of observations. For a particular $x$,
Increases in model complexity tend to increase variance and decrease bias.
$MSE_x(\theta) = \sigma_{y|x}^2 + \int (y_i - f(x_i ; \theta))^2 p(y|x) dy $ $MSE = E_{p(x)} [ MSE_x ] $ $MSE = \sigma_y^2 + Bias^2 + Variance$
So $MSE_x = \sigma_y^2 + \text{Bias}^2 + \text{Variance}$.
For data set $D$ of size $N$, we assume that each observation is of the form $(x_i,y_i)$ pairs.
$p(D)$ is a distribution over all possible data sets of size $N$ (frequentist view). With respect to $p(D)$, $f(x;\theta)$ is a random quantity because $\theta$ is random with respect to $p(D)$.
Can define <latex>\begin{align*}
&= E_{p(D)}\left[\left( E[y_x] - \bar f_x + \bar f_x - f(x;\theta)\right)^2\right] & \text{cross terms cancel out}
&= E_{p(D)}\left[\left( E[y_x] - \bar f_x \right)^2\right] + E_{p(D)}\left[\left(\bar f_x - f(x;\theta)\right)^2\right]
&= \text{Bias}^2 + \text{Variance}
\end{align*}</latex>
this is the average error at x, now averaged over multiple possible data sets of size $N$.