Day52 ML Review - Dimensionality Reduction (3)

August 14, 2024 5 minute read

Nonlinear Mappings with Kernel Principal Component Analysis

Critical Concepts of Kernel PCA

Kernel Principal Component Analysis (Kernel PCA) is an extension of Principal Component Analysis (PCA) that allows for the transformation of data that is not linearly separable. Unlike traditional PCA, which only handles linear relationships between features, Kernel PCA can capture and extract features in a higher-dimensional space where the data may become linearly separable.

To address complex, nonlinear problems, we employ a technique that involves mapping the data onto a higher-dimensional feature space where the classes can be separated linearly. This process entails transforming the original examples, denoted $x \in \mathbb{R}^d $, onto a new $k$-dimensional subspace, where $k$ is significantly greater than $d$. To achieve this transformation, we utilize a nonlinear mapping function represented by:

$\phi : \mathbb{R}^d \rightarrow \mathbb{R}^k \ \ (k \gg d)$

Essentially, $\phi$ is a function that generates nonlinear combinations of the original features, thereby facilitating the translation of the initial $d$-dimensional dataset into a larger, $k$-dimensional feature space.

Put simply, we utilize Kernel Principal Component Analysis (KPCA) to nonlinearly map our data into a higher-dimensional space. In this augmented space, we then employ standard Principal Component Analysis (PCA) to reproject the data onto a lower-dimensional space. This allows us to separate the examples using a linear classifier, assuming they can be separated by density in the input space.

However, one drawback of this method is its high computational cost. To address this issue, we use the kernel trick, which enables us to compute the similarity between two high-dimensional feature vectors in the original feature space.

Mathematical Deconstruction

In standard PCA methodology, we computed the covariance between two features, $k$ and $j$, as follows:

$ \sigma_{jk} = \frac{1}{n} \sum_{i=1}^{n} \left( x_j^{(i)} - \mu_j \right) \left( x_k^{(i)} - \mu_k \right) $

Since the standardizing of features centers them at mean zero, for instance, $\mu_j = 0$ and $mu_k = 0$, we can simplify this equation for the covariance between two features as follows:

$\sigma_{jk} = \frac{1}{n} \sum_{i=1}^{n} x_j^{(i)} x_k^{(i)}$

Let’s take a look at the general equation to calculate the covariance matrix, $\Sigma$:

$\Sigma = \frac{1}{n} \sum_{i=1}^{n} x^{(i)} \left( x^{(i)} \right)^T$

Bernhard Schölkopf generalized this approach ({Kernel principal component analysis}, B. Schölkopf, A. Smola, and K.R. Muller, pages 583-588, 1997) so that we can replace the dot products between examples in the original feature space with the nonlinear feature combinations via $\phi$:

$$ \Sigma = \frac{1}{n} \sum_{i=1}^{n} \phi(x^{(i)}) \phi(x^{(i)})^T $$

To obtain the eigenvectors—the principal components—from this covariance matrix, $\Sigma$, we have to solve the following equation:

$\Sigma v = \lambda v$

$\Rightarrow \frac{1}{n} \sum_{i=1}^{n} \phi(x^{(i)}) \phi(x^{(i)})^T v = \lambda v$

$\Rightarrow v = \frac{1}{n \lambda} \sum_{i=1}^{n} \phi(x^{(i)}) \left( \phi(x^{(i)})^T v \right) = \frac{1}{n} \sum_{i=1}^{n} \alpha^{(i)} \phi(x^{(i)}$

Here, $\lambda$ and $v$ are the eigenvalues and eigenvectors of the covariance matrix, $ \Sigma $, and $ \alpha $ can be obtained by extracting the eigenvectors of the kernel (similarity) matrix, $ K$.

Side Note: Deriving the kernel matrix

The derivation of the kernel matrix can be shown as follows.

First, let’s write the covariance matrix as in matrix notation, where $ \phi(X) $ is an $ n \times k $-dimensional matrix:

$\Sigma = \frac{1}{n} \sum_{i=1}^{n} \phi(x^{(i)}) \phi(x^{(i)})^T = \frac{1}{n} \Phi(X)^T \Phi(X)$

Now, we can write the eigenvector equation as follows:

$v = \frac{1}{n} \sum_{i=1}^{n} \alpha^{(i)} \phi(x^{(i)}) = \lambda \phi(X)^T \alpha$

Since $ \Sigma v = \lambda v $, we get:

$\frac{1}{n} \phi(X) \phi(X)^T \phi(X)^T \alpha = \lambda \phi(X)^T \alpha$

Multiplying it by $ \phi(X) $ on both sides yields the following result:

$\frac{1}{n} \phi(X) \phi(X)^T \phi(X) \phi(X)^T \alpha = \lambda \phi(X) \phi(X)^T \alpha$

$\Rightarrow \frac{1}{n} \phi(X) \phi(X)^T \alpha = \lambda \alpha$

$\Rightarrow \frac{1}{n} K \alpha = \lambda \alpha$

Here, $ K $ is the similarity (kernel) matrix:

$K = \phi(X) \phi(X)^T$

We use the kernel trick to avoid calculating the pairwise dot products of the examples, $ x $, under $ \phi $ explicitly by using a kernel function, $ K $, so that we don’t need to calculate the eigenvectors explicitly:

$\kappa(x^{(i)}, x^{(j)}) = \phi(x^{(i)})^T \phi(x^{(j)})$

To put it simply, after undergoing KPCA, we receive the examples that have already been projected onto the respective components, as opposed to the standard PCA approach, which involves constructing a transformation matrix.

The kernel function, or simply the kernel, calculates a dot product between two vectors, effectively measuring similarity.

The most commonly used kernels are as follows:

Polynomial kernel: $\kappa(x^{(i)}, x^{(j)}) = \left( x^{(i)T} x^{(j)} + \theta \right)^p$ Here, $ \theta $ is the threshold, and $ p $ is the power that has to be specified by the user.
The hyperbolic tangent (sigmoid) kernel: $\kappa(x^{(i)}, x^{(j)}) = \tanh \left( \eta x^{(i)T} x^{(j)} + \theta \right)$
Gaussian kernel or radial basis function (RBF):

$\kappa(x^{(i)}, x^{(j)}) = \exp \left( -\frac{| x^{(i)} - x^{(j)} |^2}{2\sigma^2} \right)$

Summarize

To handle nonlinear mappings, Kernel PCA maps the data $\mathbf{X} $ into a higher-dimensional feature space $\mathcal{F}$ using a mapping function $\phi$:

$\phi: \mathbb{R}^d \rightarrow \mathcal{F}$

Instead of computing $ \phi(\mathbf{X}) $ explicitly (which could be computationally expensive or even impossible), Kernel PCA uses a kernel function $ k(\mathbf{x}_i, \mathbf{x}_j) $ that computes the inner product in the feature space:

$k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{F}}$

Kernel PCA Algorithm

Step 1: Compute the Kernel Matrix

Given $ n $ data points, the kernel matrix $ \mathbf{K} $ is an $ n \times n $ matrix where each element is:

$ \mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) $

Step 2: Center the Kernel Matrix

To ensure the data is centered in the feature space, we adjust $ \mathbf{K} $:

$ \mathbf{K}^\text{centered} = \mathbf{K} - \mathbf{1}_n \mathbf{K} - \mathbf{K} \mathbf{1}_n + \mathbf{1}_n \mathbf{K} \mathbf{1}_n $

where $ \mathbf{1}_n $ is an $ n \times n $ matrix with all elements equal to $ \frac{1}{n} $.

Step 3: Eigenvalue Decomposition

Perform eigenvalue decomposition on the centered kernel matrix:

$ \mathbf{K}^\text{centered} \mathbf{v}_i = \lambda_i \mathbf{v}_i $

Here, $ \mathbf{v}_i $ are the eigenvectors, and $ \lambda_i $ are the corresponding eigenvalues.

Step 4: Project the Data

The projection of the data into the new feature space is given by:

$ \mathbf{Y} = \mathbf{K}^\text{centered} \mathbf{V} $

where $ \mathbf{V} $ contains the eigenvectors corresponding to the top eigenvalues.

Interpretation of the Results

The resulting projections $ \mathbf{Y} $ are the principal components in the feature space defined by the kernel.
Selecting the top principal components (those with the largest eigenvalues) reduces the data’s dimensionality while preserving the most significant nonlinear structures.

Mathematical Example

Let’s consider a simple example with a Gaussian kernel:

Suppose you have data points $ \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n $. The Gaussian kernel is defined as:

$ k(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2\sigma^2}\right)$

Compute the kernel matrix $ \mathbf{K} $ where each element $ K_{ij} $ is the kernel function applied to data points $ \mathbf{x}_i $ and $ \mathbf{x}_j $.
Center the kernel matrix to get $ \mathbf{K}^\text{centered} $.
Perform eigenvalue decomposition on $ \mathbf{K}^\text{centered} $ to obtain the eigenvalues and eigenvectors.
The data is then projected into the new space using the top eigenvectors.

This process allows Kernel PCA to uncover complex, nonlinear relationships in the data that would be invisible to traditional PCA.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day52 ML Review - Dimensionality Reduction (3)

Nonlinear Mappings with Kernel Principal Component Analysis

Critical Concepts of Kernel PCA

Mathematical Deconstruction

Side Note: Deriving the kernel matrix

Summarize

Kernel PCA Algorithm

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)