In this post, we will talk about principal component analysis and present of few example of how it is used and what it does.
Principal component analysis, usually called PCA, is a method of identifying correlations between the features of a dataset and reducing it to a dataset of lower dimensionality. It is an essential component in many machine learning pipeline implementations, since it allows us to work with less data, speeding up the learning process and consuming less memory and disk space.
It is not within our scope to explain the mathematical details of the method, but, it is important for someone to get a good understanding of the mathematical foundation before applying it effectively.
The python code we will be using can be found here.
As an introduction, let us suppose that we are working on a machine learning problem where two of our features are temperatures in fahrenheit and celsius. Here is the dataset.
We do not need any statistical method to tell us that in fact these two features contain represent the same information and that one of them is redundant. This is also made obvious by looking at the plot, where the points seem to all lie along a line, implying a very strong linear dependence between them.
Although, as we said, this is a trivial example, we will still apply the PCA and see how it discovers this linear relationship.
Before applying the PCA, we will have to first normalize our features to give each one zero mean and unit variance. The reason is that PCA, during its process, computes projections of points and calculates distances and variances. If our features are not normalized, the figures produced will not be as accurate and comparable as when normalized.
This can be better shown with an extreme example. Let us say that we have one dataset with two feature where the first one is random numbers between 1000 and 10000 and the second random number between 0 and 1.0
Here is a plot of this dataset
It is the definition of randomness.
Let us apply PCA on this dataset, without normalizing our features. Of course, normally, we would not expect to get any feedback that will allow us to reduce our dimensionality. Here is the output of the PCA (we skip the implementation details for now):
Components variance:[ 6.93911075e+06 8.17523247e-02]
Components variance ratio:[ 9.99999988e-01 1.17813833e-08]
PCA calculated two principal components. The first pair of values is the variance along each of the calculated principal components (not the original x and y axes) and the second their percentage of contribution to the total variance. We see that the first calculated principal component dominates the variance, giving us the green light to go ahead and use only that, reducing our dimensionality to 1.
This would have become evident if our axes on the plot above were the same, like the picture below
Now the x and y axes have the same scale, and the points appear to be on a line, meaning on one dimension. Just like the PCA method told us.
However, if we normalize our features first, we get a totally different picture. And this is the real picture.
Components variance:[ 1.04625948 0.95374052]
Components variance ratio:[ 0.52312974 0.47687026]
We see that the two calculated components contribute almost equally to the total variance. Therefore, there is no dominating component and we cannot reduce our dimensionality. If we plot the normalized data, on axes of the same scale we get again our original plot, which accurately describes the randomness of the dataset.
Hoping that this was explanatory enough, let us get back to our temperatures example.
pca = PCA()
data_reduced = pca.fit_transform(data_normalized)
The code to do PCA is simple and it is this
where data_reduced is the the dataset after projecting it onto the two principal components.
The information provided by PCA on the variances of the principal components is the following
Number of principal components: 2
Components variance:[ 1.9974093 0.0025907]
Components variance ratio:[ 0.99870465 0.00129535]
Principal components: [[ 0.70710678 0.70710678] [ 0.70710678 -0.70710678]]
We see here that the first one is the dominant one, contributing much more to the overall variance of the data set, about a thousand times more actually. Hence, we can keep only the first dimension of the projected data set, data_reduced.
data_reduced still has two columns and this is because we have not yet reduced our space. We still got back two components from PCA. PCA will not decide to drop components for us. It will calculate as many components as was our original dimensionality. It will tell us though, through the variances above, how important each component is.
The principal components calculated are the y=x axis and the y=-x axis, as seen by the Principal components vectors coordinates [ 0.70710678 0.70710678] and [ 0.70710678 -0.70710678] .
So PCA, captured the information that our data lie very closely around the y=x line.
We can reconstruct the original data either by reapplying PCA and asking for one component, or by dropping the second component produced above along with the second column of data_reduced and multiplying them.
# Manually reconstruct the original data set with using only the first principal component
data_manually_reconstructed = np.dot(data_reduced[:,0].reshape((100,1)),pca.components_[0].reshape((1,2)))
plotDataSet(data_manually_reconstructed)
# Reconstruct original data
# Perform PCA asking for one component
pca = PCA(1)
data_reduced = pca.fit_transform(data_normalized)
data_reconstructed = pca.inverse_transform(data_reduced)
The mean reconstruction error is
Mean squared reconstruction error: 0.0410804299206
You can experiment with the noise we added to the features while creating the data set to see how this affects the reconstruction error.
The python code we have used can be found here.
In order to avoid making this post too long, we will stop here. In our next post, we will work with a dataset of higher dimensionality, where there is no apparent dependence among the features.
Thank you for reading.