Introduction to unsupervised learning

Edit me

Principal Components Analysis

PCA is a popular unsupervised feature reduction technique, where linear combinations of the correlated variables are created to reduce the features space. PCA decomposes the multivariate dataset into components that explain a maximum amount of the variance between datapoints, and the components are ranked in order of the explained variance. The first principal component (PC1), thus explains the most variance in the dataset. The number of PCs to be used can be determined by a cumulative frequency cutoff (e.g. the number of PCs that together explain 80% of variance). Depending on the application, the dataset may have to be normalised, as variables on the greatest scale will dominate. Not directly interpretable.

Applications

Due to its usefulness at data reduction, PCA finds wide applications, such as:

the clustering and visualisation of correlated genes in gene expression studies ref,
data visualisation/pre-processing in machine learning ref,
and the analysis of protein dynamics ref.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Tags: