Day148 - STAT Review: Unsupervised Learning (1)
Practical Statistics for Data Scientists: Principal Components Analysis (1) (Unsupervised Learning, A Simple Example and Computing the Principal Components)
Unsupervised Learning
Unsupervised learning involves statistical methods to extract meaning from data without labeled training. Supervised learning aims to build a model that predicts a response variable using predictor variables. In contrast, unsupervised learning constructs a model without distinguishing between response and predictor variables.
Unsupervised learning can be used to achieve different goals. Sometimes, it can create a predictive rule without a labeled response.
In supervised learning (previous postings), we:
- Have labeled data (e.g., emails labeled as spam or not spam).
- Build models that predict a known response.
In unsupervised learning, we:
- Have no labeled output.
- Only have features (inputs).
- Aim to discover patterns and relationships hidden in the data without being told what to look for.
Here are the primary unsupervised learning methodologies.
- Clustering: Group similar observations.
- Example: Group customers based on their purchasing behavior.
- Dimensionality Reduction: Reduce many variables into a smaller, easier-to-handle set.
- Example: Thousands of sensor readings → a few key features.
- Exploratory Analysis: Understand the structure of a complex dataset.
- It is beneficial in big data scenarios.
How Unsupervised Learning Connects to Prediction
Even though unsupervised learning doesn’t directly predict, it supports predictive modeling. Sometimes, we want to predict a category without any labeled data. For example, we can predict an area’s vegetation type from satellite sensory data. Since we don’t have a response variable to train a model, clustering allows us to identify common patterns and categorize the regions.
Clustering is crucial for the “cold-start problem,” like launching new marketing campaigns or detecting new fraud types. Initially, there’s no data to train a model, but we can develop a predictive model as we gather data. Clustering accelerates learning by pinpointing population segments.
Unsupervised learning is vital for regression and classification. In big data, unrepresented subpopulations can lead to poor model performance. Clustering helps identify and label these subpopulations. Separate models can be created for each, or a unique feature can represent the subpopulation, allowing the main model to include subpopulation identity as a predictor.
Principal Component Anlalysis
Often, variables will vary together (covary), and some variation in one is duplicated by the variation in another (e.g., restaurant checks and tips).
Principal components analysis (PCA) is used to discover how numeric variables covary.
When we have many correlated variables, they often repeat similar information (e.g., income and credit card limit). Principal Components Analysis (PCA) creates new variables (principal components) that are linear combinations of the original variables and uncorrelated (orthogonal). The main goal is to capture as much variance (spread) as possible with fewer variables.
Key Terms for Principal Components Analysis
- Principal component
- A linear combination of the predictor variables.
- Loadings
- The weights that transform the predictors into the components.
- = Weights
- Scree-plot
- This graph illustrates the components’ variances, depicting their relative importance as either explained variance or the proportion of explained variance.
PCA combines multiple numeric predictors into a smaller set of variables, or principal components, which explain most of the variability while reducing the data dimensions. The weights reveal the original variables’ contributions to the principal components.
A Simple Example
For two variables, $X_1$ and $X_2$, there are two principal components $Z_i (i=1\ or\ 2)$: $Z_i = w_{i,1} + w_{i,2}X_2$. The weights ($w_{i,1},\ w_{i,2}$) are known as the component loadings. These transform the original variables into the principal components.
The first principal component, $Z1$, is the linear combination that best accounts for the total variation. The second principal component, $Z2$, is orthogonal to the first and captures as much of the remaining variation as possible. (If there were additional components, each additional one would be orthogonal to the others.)
Suppose we have two variables, say stock returns for Chevron (CVX) and ExxonMobil (XOM).
-
In R
oil_px <- sp500_px[, c('CVX', 'XOM')] pca <- princomp(oil_px) pca$loadings --- Loadings: Comp.1 Comp.2 CVX -0.747 0.665 XOM -0.665 -0.747 Comp.1 Comp.2 SS loadings 1.0 1.0 Proportion Var 0.5 0.5 Cumulative Var 0.5 1.0
-
In Python, we can use the
scikit-learn
implementationsklearn.decomposition.PCA
pcs = PCA(n_components=2) pca.fit(oil_px) loadings = pd.DataFrame(pcs.components_, columns=oil_px.columns) loadings
The weights for CVX and XOM in the first principal component are -0.747 and -0.665, respectively; for the second principal component, they are 0.665 and -0.747. The first principal component represents an average of CVX and XOM, reflecting the correlation between the two energy companies—the second principal component measures when the stock prices of CVX and XOM diverge. Let’s create a visualization in R as follows.
-
In R
loadngs <- pca$loadings ggplot(data=oil_px, aes=(x=CVX, y=XOM)) + geom_point(alpha=.3) + stat_elipse(type='norm', level=.99) + geom_abline(intercept = 0, slope = loadings[2,1]/loadings[1,1]) + geom_abline(intercept = 0, slope = loadings[2,2]/loadings[1,2])
-
The following code creates a similar visualization in Python
def abline(slope, intercept, ax): """Calculate coordinates of a line based on slope and intercept""" x_vals = np.array(ax.get_xlim()) return (x_vals, intercept + slope * x_vals) ax = oil_px.plot.scatter(x='XOM', y='CVX', alpha=0.3, figsize=(4, 4)) ax.set_xlim(-3, 3) ax.set_ylim(-3, 3) ax.plot(*abline(loadings.loc[0, 'CVX'] / loadings.loc[0, 'XOM'], 0, ax), '--', color='C1') ax.plot(*abline(loadings.loc[1, 'CVX'] / loadings.loc[1, 'XOM'], 0, ax), '--', color='C1')

The dashed lines indicate the direction of the two principal components: the first aligns with the ellipse's long axis, while the second aligns with the short axis. The first principal component accounts for most of the variability in the two stock returns. This is logical, as energy stock prices generally move together.
But we should note that:
- Signs don’t matter: The principal component remains the same if all loadings are flipped. For example, using weights of $0.747$ and $0.665$ for the first principal component is equivalent to using the negative weights; similarly, an infinite line defined by the origin and $(1, 1)$ is the same as one defined by the origin and $(-1,-1)$.
- PCA only works with numeric variables.
Computing Principal Components
Going from two variables to more variables is straightforward. For the first component, add predictor variables to the linear combination, with weights that optimize covariation for this principal component.
Calculating principal components is a classic statistical method using either the correlation or covariance matrix, and it executes quickly without iteration. As mentioned, principal components analysis applies solely to numeric variables, not categorical ones. The complete process follows:
- In creating the first principal component, PCA arrives at the linear combination of predictor variables, maximizing the percentage of total variance explained.
- This linear combination then becomes the first “new” predictor, Z1.
- PCA repeats this process, using the same variables with different weights, to create a second new predictor, Z2. The weighting is done such that Z1 and Z2 are uncorrelated.
- The process continues until you have as many new variables, or Zi components, as original variables Xi.
- Choose to retain as many components as are needed to account for most of the variance.
- The result so far is a set of weights for each component. The final step is to convert the original data into new principal component scores by applying the weights to the original values. These new scores can then be used as the reduced predictor variables.
Leave a comment