Dimensionality reduction is an unsupervised learning technique commonly used to make learning from data more reliable, remove extraneous or noisy variables, or computation more tractable. Map from original data space in $X^i$ in $i$ dimensions to a new data space. Related to the task of compression: How much of the underlying structure or information of the original representation is retained?
a.k.a. Variable selection. Given d variables, want to select k.
a.k.a. greedy approaches. The “best” one to use probably depends on how many irrelevant variables are expected.
As greedy approaches, these are non-optimal. They also may require training th emodel $O(d^2)$ times, which could become expensive for large d.
Train a model for each variable and compute error each time. Select the variable that gives the lowest error E (out of the d candidates). Now evaluate the error given by this variable with one additional variable (out of the [d-1] remaining candidates). Continue adding variables until k have been selected.
Starting with a model that uses all variables, iteratively remove the variable which reduces the most error.
These tend to do well when the data looks Gaussian, but easily throw away important data when there is something multi-modal happening.
$x$ is a d-dimensional vector of data measurements, and $a$ is a d-dimensional column weight vector. Assume $a^T A = 1$. Can
$a^T x$ is a projection of $x$ onto one dimension. Can generalize to use a matrix of multiple $a$ vectors to produce more than one dimension. Let $A$ be this matrix.
Principal component analysis is used to do this repeatedly and well. From linear algebra, the eigendecomposition produces eigenvectors and eigenvalues, which when ordered by size produce principal components.
Overall complexity is $O(Nd^2 + d^3)$
Principal components are frequently nice and interpretable.
e.g. Lasso
Want to find k-dimensional coordinates for each of the N objects such that the Euclidean distances in embedded space matches the known dissimilarity matrix as closely as possible. Data is a distance matrix or dissimilarity matrix.
Typically use a “stress” objective function. $$ S = \sum_{i,j} (d(,j) - \delta(i,j))^2 $$ Where $d(i,j)$ is the Euclidean distance in the new embedded k-dimensional space, and $\delta(i,j)$ is the Euclidean distance in the original space.
If the distance matrix represents actual Euclidean distance, then the result tends to be similar to PCA.
Local iterative hill-descending. Non-trivial and can be caught in local minima. Initilization can be random or use a heuristic. $O(N^2k)$ work per iteration.
Fast approximate algorithms exist: Fast sparse embedding, Landmark MDS, …
Create a graph of the N data points with each as a node. Create an edge between “close” nodes. Approximate the original distance as the distance between nodes on this new graph structure as a geodesic manifold.