Day129 - STAT Review: Regression and Prediction (7)

February 26, 2025 4 minute read

Practical Statistics for Data Scientists: Stepwise Regression & Model Selection in Depth

441FCBEF-0799-4E4D-B1FD-1BAB38C9C1DC_1_105_c

When building a regression model, the temptation is to include as many variables as possible to capture every detail. However, this approach can lead to overfitting, where the model performs exceptionally well on the training data but fails to generalize to new data.

Key Issues with Too Many Predictors
- Overfitting: The model learns noise instead of patterns.
- Computational Complexity: Larger models take longer to train and interpret.
- Redundant Variables: Some predictors provide little or no additional information.
- Multicollinearity: High correlation between predictors can distort results.
To address these issues, employ statistical methods like Adjusted $R^2$ ** to assess models according to their predictive capabilities. Additionally, AIC (Akaike Information Criterion) balances model complexity and goodness of fit. **Stepwise regression iteratively refines the model by adding or removing variables. Penalized regression techniques, such as Ridge and Lasso, reduce overfitting and enhance generalization.

Adjusted $R^2$

The $R^2$ (Coefficient of Determination) measures how well the model explains variance in the data. however, $R^2$ always increases when more variables are added, even if they are irrelevant. Adjusted $R^2$ penalizes unnecessary predictors, preventing overfitting.

$R^2_{adj} = 1 - \big(\frac{(1-R^2)(n-1)}{n-P-1} \big)$

Where,

$n$ = number of observations.
$P$ = number of predictors.

If adding a predictor improves the model, Adjusted $R^2$ increases. If adding a predictor does not contribute significantly, Adjusted $R^2$ decreases.

Akaike Information Criterion (AIC)

In the 1970s, Hirotugu Akaike introduced a model selection criterion that **penalizes complexity while rewarding goodness of fit and minimizing AIC to balance model simplicity and accuracy.

$AIC = 2P+ n \text{log}(RSS/n)$

Where:

$P$ = number of variables.
$RSS$ = residual sum of squares (error measure).
$n$ = number of observations.

Variants of AIC

The formula for AIc might appear mysterious, yet it is fundamentally rooted in asymptotic results from information theory. Below are several variations of AIC.

Metric	Description
AIC	AIC adjusted for small sample sizes.
BIC	Bayesian Information Criterion (more substantial penalty for extra variables).
Mallows $C_p$	Alternative to AIC, developed by Colin Mallows.

We can explore methods to minimize AIC or maximize adjusted $R^2$ by examining all possible models, a technique referred to as all subset regression. However, this approach is computationally intensive and impractical for large datasets with numerous variables.

Stepwise Regression

Stepwise Regression is an automated feature selection method that adds or removes predictors based on their statistical significance.

One approach is to begin with a comprehensive model and then sequentially remove variables that lack significant contribution, known as backward elimination. Conversely, another method is to start with a constant model and progressively include variables, referred to as forward selection. Each steps are as follows.

Types of Stepwise Regression

Forward Selection:
- Starts with no predictors.
- Adds one predictor at a time (choosing the best at each step).
- Stops when adding more variables does not improve the model.
Backward Elimination:
- Starts with all predictors.
- Removes the least significant variable at each step.
- Stops when all remaining variables are statistically significant.
Bidirectional (Both Directions):
- Mix of **forward selection & backward elimination. **

In R

library(MASS)
  
# Fit full model
house_full <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms +
                 Bedrooms + BldgGrade + PropertyType + NbrLivingUnits +
                 SqFtFinBasement + YrBuilt + YrRenovated +
                 NewConstruction, data=house, na.action=na.omit)
  
# Perform stepwise regression
step <- stepAIC(house_full, direction="both")
  
# Print selected model
print(step)

The model removes unnecessary variables and retains only the most relevant ones.

In Python, unlike R, scikitlearn does not have built-in stepwise regression, so we must implement it manually. Forward Selection in Python is as follows.

import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
  
# Define outcome variable and predictors
predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade',
              'PropertyType', 'NbrLivingUnits', 'SqFtFinBasement', 'YrBuilt',
              'YrRenovated', 'NewConstruction']
  
# Convert categorical variables
X = pd.get_dummies(house[predictors], drop_first=True)
X['NewConstruction'] = X['NewConstruction'].astype(int)
  
y = house['AdjSalePrice']
  
# Define function for forward selection
def forward_selection(X, y):
    selected_features = []
    remaining_features = list(X.columns)
      
    while remaining_features:
        scores = {}
        for feature in remaining_features:
            model = sm.OLS(y, sm.add_constant(X[selected_features + [feature]])).fit()
            scores[feature] = model.aic
          
        best_feature = min(scores, key=scores.get)
        selected_features.append(best_feature)
        remaining_features.remove(best_feature)
          
        print(f"Added {best_feature}, AIC: {scores[best_feature]:.2f}")
  
    return selected_features
  
# Perform forward selection
best_features = forward_selection(X, y)
  

Penalized Regression (Ridge & Lasso)

Penalized regression is similar to stepwise regression as it incorporates constraints that penalize models for excessive variables. Instead of completely eliminating predictors, as in stepwise selection, it reduces coefficients, sometimes to nearly zero.

Ridge Regression (L2 Regularization)

Adds a penalty to prevent large coefficients.
Keeps all variables but reduces their impact.

$\text{min} \sum(y-\hat{y})^2 + \lambda \sum \beta^2$

Lasso Regression (L1 Regularization)

Forces some coefficients to be exactly zero, acting as an automatic feature selector.

$\text{min} \sum (y-\hat{y})^2 + \lambda \sum \vert \beta \vert$

We utilize Ridge and Lasso regression techniques, which are more stable alternatives to stepwise regression, in order to prevent overfitting. Additionally, these methods allow us to manage correlated variables more effectively.

In Python

from sklearn.linear_model import Ridge, Lasso
  
ridge = Ridge(alpha=1.0)  # Ridge regression
lasso = Lasso(alpha=0.1)  # Lasso regression
  
ridge.fit(X, y)
lasso.fit(X, y)
  
print("Ridge Coefficients:", ridge.coef_)
print("Lasso Coefficients:", lasso.coef_)

Summary of Each Method

Method	Pros	Cons
Stepwise Regression	Simple, interpretable	Prone to overfitting, does not handle correlated variables well
Ridge Regression	Reduces overfitting, keeps all predictors	Cannot eliminate unimportant variables
Lasso Regression	Feature selection, prevents overfitting	May drop important correlated features
AIC/BIC/Mallows $C_p$	Theoretically sound, penalizes complexity	Computationally expensive for large datasets

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin