Introduction
Dimensionality reduction is a key preprocessing step in machine learning and data analysis, particularly when dealing with high-dimensional data. It helps in reducing the number of input variables, making the data easier to visualise and more manageable for modelling. Two popular techniques taught in most Data Scientist Classes for dimensionality reduction are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbour Embedding (t-SNE).
This article will explain these techniques and their applications.
Principal Component Analysis (PCA)
PCA is a statistical technique used to transform a high-dimensional dataset into a lower-dimensional space by finding the directions (principal components) that maximise the variance in the data. These principal components are orthogonal to each other, ensuring that they capture the most important information in the data. Most data analysts and scientists prefer to learn PCA and t-SNE together and for this reason, a Data Science Course in Bangalore, for example, would cover both these techniques.
How PCA Works
- Standardise the Data: Mean-centre and scale the data to have a mean of zero and a standard deviation of one.
- Compute the Covariance Matrix: Calculate the covariance matrix of the standardised data.
- Calculate Eigenvalues and Eigenvectors: Determine the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the directions of maximum variance, while the eigenvalues represent the magnitude of the variance in these directions.
- Select Principal Components: Choose the top k eigenvectors corresponding to the largest eigenvalues to form the principal components.
- Transform the Data: Project the original data onto the selected principal components to obtain the reduced-dimensional representation.
Advantages of PCA
- Reduces dimensionality while preserving most of the variance in the data.
- Helps in visualising high-dimensional data.
- Reduces computational complexity and noise.
Example of PCA in Python
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load dataset (for example, the Iris dataset)
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plot the PCA results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=’viridis’, edgecolor=’k’, s=50)
plt.xlabel(‘Principal Component 1’)
plt.ylabel(‘Principal Component 2’)
plt.title(‘PCA of Iris Dataset’)
plt.show()
t-Distributed Stochastic Neighbour Embedding (t-SNE)
t-SNE is a non-linear dimensionality reduction technique particularly well-suited for visualising high-dimensional data in two or three dimensions. Unlike PCA, which focuses on capturing the maximum variance, t-SNE emphasises preserving the local structure of the data, making it ideal for visualising clusters. Both PCA and t-SNE are advanced techniques, yet, popularly used for dimensional reduction of datasets. An advanced Data Science Course in Bangalore, for example, tailored for data scientists will place equal weightage on both these techniques.
How t-SNE Works
- Compute Pairwise Similarities: Calculate pairwise similarities between data points in the high-dimensional space using a Gaussian distribution.
- Compute Pairwise Similarities in Low-Dimensional Space: Initialise a random distribution in the low-dimensional space and compute pairwise similarities using a t-distribution, which has heavier tails than a Gaussian distribution.
- Minimise the Kullback-Leibler Divergence: Use gradient descent to minimise the Kullback-Leibler divergence between the high-dimensional and low-dimensional pairwise similarities.
Advantages of t-SNE
- Effectively visualises high-dimensional data in 2D or 3D.
- Preserves local structure, making it easier to identify clusters.
- Handles complex non-linear relationships.
Example of t-SNE in Python
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Load dataset (for example, the Iris dataset)
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
X_tsne = tsne.fit_transform(X)
# Plot the t-SNE results
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=’viridis’, edgecolor=’k’, s=50)
plt.xlabel(‘t-SNE Component 1’)
plt.ylabel(‘t-SNE Component 2’)
plt.title(‘t-SNE of Iris Dataset’)
plt.show()
Comparing PCA and t-SNE
Most Data Scientist Classes offer PCA and t-SNE together. Either of these can be used for dimensionality reduction depending on the context and scope of application.
PCA
- Linear technique.
- Preserves global structure.
- Computationally less intensive.
- Suitable for reducing dimensions for further modelling.
t-SNE
- Non-linear technique.
- Preserves local structure.
- Computationally intensive.
- Ideal for visualisation and identifying clusters.
Conclusion
PCA and t-SNE are powerful techniques for dimensionality reduction, each with its own strengths and use cases. PCA is ideal for reducing dimensions while retaining the most variance and is useful for subsequent modelling tasks. t-SNE excels at visualising complex, high-dimensional data and uncovering underlying patterns and clusters. Senior data analysts and data scientists are increasingly enrolling for Data Scientist Classes for understanding and learning to apply these techniques, which can substantially enhance data analysis and machine learning workflows.
For More details visit us:
Name: ExcelR – Data Science, Generative AI, Artificial Intelligence Course in Bangalore
Address: Unit No. T-2 4th Floor, Raja Ikon Sy, No.89/1 Munnekolala, Village, Marathahalli – Sarjapur Outer Ring Rd, above Yes Bank, Marathahalli, Bengaluru, Karnataka 560037
Phone: 087929 28623
Email: enquiry@excelr.com