Consider the case where you are being provided a data set which has say ten thousand independent variables or columns or predictor variables and say ten millions of records or observations or rows.
Let’s now just focus on ten thousand independent variables and also assume they are continuous so that you need not worry about other things like one-hot encoding,etc.
And now you have to apply Classification algorithms or clustering like k-means or do data visualization in this data set.
What you will do now ??
Consider Machine Learning Algorithm
Will you use all the independent variables in the algorithm.If yes then their may be many issues.Some of the important ones are:
b. Algorithm will take too much time to run.
c. Your computer RAM is unsufficient.(Memory Issues)
Consider Data Visualization
You would like to visualize the data in 2 or 3 dimention.But the data is high dimentional(ten thousand variables).So what you would do now?
Consider k-means Clustering
You would like to find the clusters in the data using k-means clustering.But again you will face following issues because in k-means you need to find the distances between the points and these points are in ten thousand axis coordinate system;not in 2D or 3D coordinate system.The distance between any two points in 2D system is calculated as (x1-x2)^2+(y1-y2)^2.So there are only two square terms in 2D system but in ten thousand axis system it would require ten thousand terms like this and then square root for each of the pair of data points.You could imagine the amount of work which your PC RAM would require to do in this.Again the issues are :
a. Algorithm will take too much time to run.
b. Your computer RAM is unsufficient.(Memory Issues)
So what to do now??? Rescue is Dimentional Reduction Techniques!!
What is Dimentionality Reduction?
Dimensionality reduction is the process of reducing the number of random variables . It can be divided into feature selection and feature extraction.In many problems, the measured data vectors are high-dimensional but we can try to convert into lower-dimensional manifold.
Types of Dimentionality Reduction Techniques
There are several Dimentional Reduction Techniques!!Some of them are :
1.PCA(Principal component Analysis)
2.Forward Feature Constructor
3.Backward Feature Elimination
What is Principal Component Analysis?
Principal components analysis (PCA) is a very popular technique for dimensionality reduction.
PCA tries to reduce the dimentionality of the data i.e in simple words,tries to reduce the number of columns in the data.
It tries to transform the given variables in the data set into new features or variables in an effort to explain all of the variance in the data i.e in simple words,without losing any information in the data set.These newly created feature variables are called Principal Components.
How PCA components are calculated
The newly created principal components are orthogonal vectors.
1. The covariance or correlation matrix of the data is constructed
2. The eigen vectors on this matrix are computed.
3. The eigenvector with the largest eigenvalue is the direction along which the data set has the maximum variance.It is also the called as first Principal component.