# Dimensionality and correlation

I’ve been thinking about the difficulties that highly correlated variables pose in a supervised learning context. The supervised learning problem is typically to learn a regressor or classifier from input to observation , where is a set of predictor variables and a set of observations . If the predictor variables are highly correlated, the learning algorithm — which perhaps assumes that they are independent — is at a disadvantage.

This is in part the motivation for regularization techniques such as the lasso, which is designed to handle the case where . It can nonetheless be useful to winnow one’s set of independent variables to remove highly correlated variables. Doing so can, for instance, result in models that are easier to interpret. Further, since lasso, for instance, tends to arbitrarily choose a variable from some set of strongly correlated variables that are a subset of , reducing or eliminating highly correlated variables from can result in more consistent variable selection when building multiple models.

After running across Stephen Turner’s recent post about visualizing correlations (especially the oh-so-useful `chart.Correlation`

in PerformanceAnalytics), I decided to go a step further and see whether fractal dimensionality can expose correlations in one’s data. Using the correlation dimension for a distance (), I plotted log against the log of the distance. Roughly speaking, the slope of the plot is a measure of the dimensionality of the data. I expected the highly correlated variables to distort the log-log plot of correlation dimension against length in some way, causing unexpected curvature or even discontinuity.

Consider these two matrices.

> M1 <- matrix(rnorm(50*100), ncol=50) > M2 <- as.matrix(M1) > M2[,1:25] <- 0

And their correlation dimension plot.

The correlation dimension plot for `M1`

shows a fairly smooth, likely concave line. That for `M2`

— and this appears to be typical of data sets with groups of highly correlated variables — has segments with greater curvature and segments with zero slope. My intuition is that the segments with zero slope occur because correlated variables are close to one another; the reason for the greater curvature is not at all clear to me. Regardless, if there’s research on the use of fractal dimensionality for subset selection, I’d love to hear about it.

Nice post. If your goal is to get rid of some correlated variables, the caret package in R has some nice functionality.

Thanks, Stephen. Caret looks like it can help; I’ll add it to my To Try queue alongside subselect.