Essentially the same idea as principal component analysis. Factorize the large N-by-M matrix into “skinny” matrices of size N-by-k and k-by-M. The factorization results in an approximation via singular value decomposition.
Each of the $k$ dimensions is referred to as a factor. Both dimensions of the data can then be represented in this new latent feature space. Finding the $k$ factors is equivalent to performing SVD, which has complexity $O(mn^2 + n^3)$.
$ r_{ui} \equiv q_i^t p_u $ $ min_{q,p} \left{ \sum_{(u,i) \in R} ( r_{ui} - q_i^t p_u)^2 \right} $
There are systematic effects that have to be taken out:
These systematic effects can be modeled in baseline predictors.
The new prediction model becomes $r_{ui} \equiv \mu + b_u + b_i + \text{user-movie interactions}$.
Regularization turns out to be very important. Assists in learning, but you can also regularize incomplete data toward the center of the latent space. $ min_{q,p} \left{ \sum_{(u,i) \in R} ( r_{ui} - [\mu + p_u + b_u + b_i + q_i^t p_u])^2 + \lambda (|q_i|^2 + |p_u|^2 |b_u|^2 |b_i|^2) \right}$
One method for learning the approximation components is gradient descent. The parameters to learn are the biases $\mu, b_u, b_i$.
The matrix factorization can also be learned with gradient descent. Rather than using the linear algebra SVD methods to do it, the factors are treated as an additional set of $k(N+M)$ parameters to learn. Update equations with regularization are: