Michael's Wiki

In one sense a large purpose of Exploratory data analysis, or descriptive data mining, is to reduce dimensionality of high-dimensional data into more easily visualized single or pairs of variables. This includes feature selection and machine learning.

Data Visualization is a critical component of exploratory analysis, helping the analyst to produce and check initial assumptions.

Outlier Detection

An outlier is basically a data point which was somehow measured in error. It is important to identify whether the data point belongs or not.

Visualization can help finding outliers. Outliers may not be apparent in lower-dimension plots, but then show up in higher-dimension plots.

Domain knowledge and constraints may also help.

The model itself may also help. e.g. in regression, large error data points may be outliers.

In general, for large data sets you can probably assume that outliers remain in the data that won't all be found and removed.

Analyzing Single Variables

Summary Statistics

These statistics may be misleading and over-simplifications, but they are important to check in order to do sanity-checking on data.

  • Mean
  • Mode
  • Variance
  • Skew
  • Range
  • Median
  • Quantiles
  • The number of unique values

Analyzing Pairs of Variables

In exploratory analysis, we frequently want to learn if one variable is potentially useful at predicting another.

Linear correlation

Linear correlation and covariance are measures of linear dependence of variables.

Covariance is a measure of how they vary together.

  • $\sigma_{ij}$ = covariance between $x_i$ and $x_j$
    • = $E_{p(x_i,x_j)}[(x_i - \mu_i)(x_j - mu_j)]$
  • $\Sigma$ is the covariance matrix. It is symmetric: $\Sigma_{ij} = \Sigma_{ji}$

Linear Correlation is a scaled covariance that varies between +/-1.

  • $\rho_{ij} = \frac{cov(i,j)}{\sigma_i \sigma_j}$

In high-dimensional data or with small number of observations, there is a high probability that entirely independent variables will appear dependent. This can be simulated easily to provide a control set of correlations to compare to the real-world data.

Lack of linear dependence doesn't imply lack of non-linear relationships.

Analyzing Higher Dimensions


Use unsupervised learning methods such as K-Means or hierarchical clustering to look for underlying structure in the data.

Visualize the clustering results via dendrograms.