Day129 - STAT Review: Regression and Prediction (7)
Practical Statistics for Data Scientists: Stepwise Regression & Model Selection in Depth
When building a regression model, the temptation is to include as many variables as possible to capture every detail. However, this approach can lead to overfitting, where the model performs exceptionally well on the training data but fails to generalize to new data.
- Key Issues with Too Many Predictors
- Overfitting: The model learns noise instead of patterns.
- Computational Complexity: Larger models take longer to train and interpret.
- Redundant Variables: Some predictors provide little or no additional information.
- Multicollinearity: High correlation between predictors can distort results.
- To address these issues, employ statistical methods like Adjusted $R^2$ ** to assess models according to their predictive capabilities. Additionally, AIC (Akaike Information Criterion) balances model complexity and goodness of fit. **Stepwise regression iteratively refines the model by adding or removing variables. Penalized regression techniques, such as Ridge and Lasso, reduce overfitting and enhance generalization.
Adjusted $R^2$
The $R^2$ (Coefficient of Determination) measures how well the model explains variance in the data. however, $R^2$ always increases when more variables are added, even if they are irrelevant. Adjusted $R^2$ penalizes unnecessary predictors, preventing overfitting.
Where,
- $n$ = number of observations.
- $P$ = number of predictors.
If adding a predictor improves the model, Adjusted $R^2$ increases. If adding a predictor does not contribute significantly, Adjusted $R^2$ decreases.
Akaike Information Criterion (AIC)
In the 1970s, Hirotugu Akaike introduced a model selection criterion that **penalizes complexity while rewarding goodness of fit and minimizing AIC to balance model simplicity and accuracy.
Where:
- $P$ = number of variables.
- $RSS$ = residual sum of squares (error measure).
- $n$ = number of observations.
Variants of AIC
The formula for AIc might appear mysterious, yet it is fundamentally rooted in asymptotic results from information theory. Below are several variations of AIC.
Metric | Description |
---|---|
AIC | AIC adjusted for small sample sizes. |
BIC | Bayesian Information Criterion (more substantial penalty for extra variables). |
Mallows $C_p$ | Alternative to AIC, developed by Colin Mallows. |
We can explore methods to minimize AIC or maximize adjusted $R^2$ by examining all possible models, a technique referred to as all subset regression. However, this approach is computationally intensive and impractical for large datasets with numerous variables.
Stepwise Regression
Stepwise Regression is an automated feature selection method that adds or removes predictors based on their statistical significance.
One approach is to begin with a comprehensive model and then sequentially remove variables that lack significant contribution, known as backward elimination. Conversely, another method is to start with a constant model and progressively include variables, referred to as forward selection. Each steps are as follows.
Types of Stepwise Regression
- Forward Selection:
- Starts with no predictors.
- Adds one predictor at a time (choosing the best at each step).
- Stops when adding more variables does not improve the model.
- Backward Elimination:
- Starts with all predictors.
- Removes the least significant variable at each step.
- Stops when all remaining variables are statistically significant.
- Bidirectional (Both Directions):
- Mix of **forward selection & backward elimination. **
- Mix of **forward selection & backward elimination. **
-
In R
library(MASS) # Fit full model house_full <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + NbrLivingUnits + SqFtFinBasement + YrBuilt + YrRenovated + NewConstruction, data=house, na.action=na.omit) # Perform stepwise regression step <- stepAIC(house_full, direction="both") # Print selected model print(step)
The model removes unnecessary variables and retains only the most relevant ones.
-
In Python, unlike R, scikitlearn does not have built-in stepwise regression, so we must implement it manually. Forward Selection in Python is as follows.
import pandas as pd import statsmodels.api as sm from sklearn.linear_model import LinearRegression # Define outcome variable and predictors predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade', 'PropertyType', 'NbrLivingUnits', 'SqFtFinBasement', 'YrBuilt', 'YrRenovated', 'NewConstruction'] # Convert categorical variables X = pd.get_dummies(house[predictors], drop_first=True) X['NewConstruction'] = X['NewConstruction'].astype(int) y = house['AdjSalePrice'] # Define function for forward selection def forward_selection(X, y): selected_features = [] remaining_features = list(X.columns) while remaining_features: scores = {} for feature in remaining_features: model = sm.OLS(y, sm.add_constant(X[selected_features + [feature]])).fit() scores[feature] = model.aic best_feature = min(scores, key=scores.get) selected_features.append(best_feature) remaining_features.remove(best_feature) print(f"Added {best_feature}, AIC: {scores[best_feature]:.2f}") return selected_features # Perform forward selection best_features = forward_selection(X, y)
Penalized Regression (Ridge & Lasso)
Penalized regression is similar to stepwise regression as it incorporates constraints that penalize models for excessive variables. Instead of completely eliminating predictors, as in stepwise selection, it reduces coefficients, sometimes to nearly zero.
Ridge Regression (L2 Regularization)
- Adds a penalty to prevent large coefficients.
- Keeps all variables but reduces their impact.
Lasso Regression (L1 Regularization)
- Forces some coefficients to be exactly zero, acting as an automatic feature selector.
We utilize Ridge and Lasso regression techniques, which are more stable alternatives to stepwise regression, in order to prevent overfitting. Additionally, these methods allow us to manage correlated variables more effectively.
-
In Python
from sklearn.linear_model import Ridge, Lasso ridge = Ridge(alpha=1.0) # Ridge regression lasso = Lasso(alpha=0.1) # Lasso regression ridge.fit(X, y) lasso.fit(X, y) print("Ridge Coefficients:", ridge.coef_) print("Lasso Coefficients:", lasso.coef_)
Summary of Each Method
Method | Pros | Cons |
---|---|---|
Stepwise Regression | Simple, interpretable | Prone to overfitting, does not handle correlated variables well |
Ridge Regression | Reduces overfitting, keeps all predictors | Cannot eliminate unimportant variables |
Lasso Regression | Feature selection, prevents overfitting | May drop important correlated features |
AIC/BIC/Mallows $C_p$ | Theoretically sound, penalizes complexity | Computationally expensive for large datasets |
Leave a comment