Day148 - STAT Review: Unsupervised Learning (1)

April 2, 2025 5 minute read

Practical Statistics for Data Scientists: Principal Components Analysis (1) (Unsupervised Learning, A Simple Example and Computing the Principal Components)

BC022361-277B-4EEB-8F41-5C78A855C5EE_1_105_c

Unsupervised Learning

Unsupervised learning involves statistical methods to extract meaning from data without labeled training. Supervised learning aims to build a model that predicts a response variable using predictor variables. In contrast, unsupervised learning constructs a model without distinguishing between response and predictor variables.

Unsupervised learning can be used to achieve different goals. Sometimes, it can create a predictive rule without a labeled response.

In supervised learning (previous postings), we:

Have labeled data (e.g., emails labeled as spam or not spam).
Build models that predict a known response.

In unsupervised learning, we:

Have no labeled output.
Only have features (inputs).
Aim to discover patterns and relationships hidden in the data without being told what to look for.

Here are the primary unsupervised learning methodologies.

Clustering: Group similar observations.
- Example: Group customers based on their purchasing behavior.
Dimensionality Reduction: Reduce many variables into a smaller, easier-to-handle set.
- Example: Thousands of sensor readings → a few key features.
Exploratory Analysis: Understand the structure of a complex dataset.
- It is beneficial in big data scenarios.

How Unsupervised Learning Connects to Prediction

Even though unsupervised learning doesn’t directly predict, it supports predictive modeling. Sometimes, we want to predict a category without any labeled data. For example, we can predict an area’s vegetation type from satellite sensory data. Since we don’t have a response variable to train a model, clustering allows us to identify common patterns and categorize the regions.

Clustering is crucial for the “cold-start problem,” like launching new marketing campaigns or detecting new fraud types. Initially, there’s no data to train a model, but we can develop a predictive model as we gather data. Clustering accelerates learning by pinpointing population segments.

Unsupervised learning is vital for regression and classification. In big data, unrepresented subpopulations can lead to poor model performance. Clustering helps identify and label these subpopulations. Separate models can be created for each, or a unique feature can represent the subpopulation, allowing the main model to include subpopulation identity as a predictor.

Principal Component Anlalysis

Often, variables will vary together (covary), and some variation in one is duplicated by the variation in another (e.g., restaurant checks and tips).

Principal components analysis (PCA) is used to discover how numeric variables covary.

When we have many correlated variables, they often repeat similar information (e.g., income and credit card limit). Principal Components Analysis (PCA) creates new variables (principal components) that are linear combinations of the original variables and uncorrelated (orthogonal). The main goal is to capture as much variance (spread) as possible with fewer variables.

Key Terms for Principal Components Analysis

Principal component
- A linear combination of the predictor variables.
Loadings
- The weights that transform the predictors into the components.
- = Weights
Scree-plot
- This graph illustrates the components’ variances, depicting their relative importance as either explained variance or the proportion of explained variance.

PCA combines multiple numeric predictors into a smaller set of variables, or principal components, which explain most of the variability while reducing the data dimensions. The weights reveal the original variables’ contributions to the principal components.

A Simple Example

For two variables, $X_1$ and $X_2$, there are two principal components $Z_i (i=1\ or\ 2)$: $Z_i = w_{i,1} + w_{i,2}X_2$. The weights ($w_{i,1},\ w_{i,2}$) are known as the component loadings. These transform the original variables into the principal components.

The first principal component, $Z1$, is the linear combination that best accounts for the total variation. The second principal component, $Z2$, is orthogonal to the first and captures as much of the remaining variation as possible. (If there were additional components, each additional one would be orthogonal to the others.)

Suppose we have two variables, say stock returns for Chevron (CVX) and ExxonMobil (XOM).

In R

oil_px <- sp500_px[, c('CVX', 'XOM')]
pca <- princomp(oil_px)
pca$loadings
---
Loadings:
    Comp.1 Comp.2
CVX -0.747  0.665
XOM -0.665 -0.747
  
               Comp.1 Comp.2
SS loadings       1.0    1.0
Proportion Var    0.5    0.5
Cumulative Var    0.5    1.0

In Python, we can use the scikit-learn implementation sklearn.decomposition.PCA

pcs = PCA(n_components=2)
pca.fit(oil_px)
loadings = pd.DataFrame(pcs.components_, columns=oil_px.columns)
loadings

The weights for CVX and XOM in the first principal component are -0.747 and -0.665, respectively; for the second principal component, they are 0.665 and -0.747. The first principal component represents an average of CVX and XOM, reflecting the correlation between the two energy companies—the second principal component measures when the stock prices of CVX and XOM diverge. Let’s create a visualization in R as follows.

In R

loadngs <- pca$loadings
ggplot(data=oil_px, aes=(x=CVX, y=XOM)) +
	geom_point(alpha=.3) +
	stat_elipse(type='norm', level=.99) +
	geom_abline(intercept = 0, slope = loadings[2,1]/loadings[1,1]) +
	geom_abline(intercept = 0, slope = loadings[2,2]/loadings[1,2]) 

The following code creates a similar visualization in Python

def abline(slope, intercept, ax):
    """Calculate coordinates of a line based on slope and intercept"""
    x_vals = np.array(ax.get_xlim())
    return (x_vals, intercept + slope * x_vals)
  
ax = oil_px.plot.scatter(x='XOM', y='CVX', alpha=0.3, figsize=(4, 4))
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)
ax.plot(*abline(loadings.loc[0, 'CVX'] / loadings.loc[0, 'XOM'], 0, ax),
        '--', color='C1')
ax.plot(*abline(loadings.loc[1, 'CVX'] / loadings.loc[1, 'XOM'], 0, ax),
        '--', color='C1')

The dashed lines indicate the direction of the two principal components: the first aligns with the ellipse's long axis, while the second aligns with the short axis. The first principal component accounts for most of the variability in the two stock returns. This is logical, as energy stock prices generally move together.

But we should note that:

Signs don’t matter: The principal component remains the same if all loadings are flipped. For example, using weights of $0.747$ and $0.665$ for the first principal component is equivalent to using the negative weights; similarly, an infinite line defined by the origin and $(1, 1)$ is the same as one defined by the origin and $(-1,-1)$.
PCA only works with numeric variables.

Computing Principal Components

Going from two variables to more variables is straightforward. For the first component, add predictor variables to the linear combination, with weights that optimize covariation for this principal component.

Calculating principal components is a classic statistical method using either the correlation or covariance matrix, and it executes quickly without iteration. As mentioned, principal components analysis applies solely to numeric variables, not categorical ones. The complete process follows:

In creating the first principal component, PCA arrives at the linear combination of predictor variables, maximizing the percentage of total variance explained.
This linear combination then becomes the first “new” predictor, Z1.
PCA repeats this process, using the same variables with different weights, to create a second new predictor, Z2. The weighting is done such that Z1 and Z2 are uncorrelated.
The process continues until you have as many new variables, or Zi components, as original variables Xi.
Choose to retain as many components as are needed to account for most of the variance.
The result so far is a set of weights for each component. The final step is to convert the original data into new principal component scores by applying the weights to the original values. These new scores can then be used as the reduced predictor variables.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day148 - STAT Review: Unsupervised Learning (1)

Practical Statistics for Data Scientists: Principal Components Analysis (1) (Unsupervised Learning, A Simple Example and Computing the Principal Components)

Unsupervised Learning

How Unsupervised Learning Connects to Prediction

Principal Component Anlalysis

Key Terms for Principal Components Analysis

A Simple Example

Computing Principal Components

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)