Day120 - STAT Review: Regression and Prediction (2)

February 13, 2025 5 minute read

Practical Statistics for Data Scientists: Assessing the Model, Cross Validation, Model Selection, and Prediction Using Regression

2F85B999-9495-4951-BF53-E79C74D6AC44_1_105_c

Assessing the Model - RMSE, RSE, R-squared

From a data science perspective, the key performance metric is the root mean square error, often abbreviated as RMSE. It denotes the square root of the average of the squared differences between predicted values ($\hat{y_i}$) and actual values.

$\text{RMSE} = \sqrt{\frac{\sum_{i=1}^n \big( y_i-\hat{y_i} \big)^2}{n}}$

This evaluates the model’s overall accuracy and provides a basis for comparison with other models, including those developed using machine learning techniques.

RSE, short for “Residual Standard Error,” is similar to RMSE. In this instance, there are p predictors.

$\text{RME} = \sqrt{\frac{\sum_{i=1}^n \big( y_i-\hat{y_i} \big)^2}{n-p-1}}$

The only difference is that the denominator reflects the degrees of freedom rather than the number of records. In practice, the difference between RMSE and RSE is minimal for linear regression, especially in big data applications.

In R, the summary function calculates the RSE and other metrics in a regression model.

summary(house_lm)
---
Call:
lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms +
    Bedrooms + BldgGrade, data = house, na.action = na.omit)
  
Residuals:
     Min       1Q   Median       3Q      Max
-1199479  -118908   -20977    87435  9473035
  
Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)   -5.219e+05  1.565e+04 -33.342  < 2e-16 ***
SqFtTotLiving  2.288e+02  3.899e+00  58.694  < 2e-16 ***
SqFtLot       -6.047e-02  6.118e-02  -0.988    0.323
Bathrooms     -1.944e+04  3.625e+03  -5.363 8.27e-08 ***
Bedrooms      -4.777e+04  2.490e+03 -19.187  < 2e-16 ***
BldgGrade      1.061e+05  2.396e+03  44.277  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  
Residual standard error: 261300 on 22681 degrees of freedom
Multiple R-squared:  0.5406,	Adjusted R-squared:  0.5405
F-statistic:  5338 on 5 and 22681 DF,  p-value: < 2.2e-16

In Python, scikit-learn provides several metrics for regression and classification. We use mean_squared_error to get RMSE and r2_score for the coefficient of determination.
```
fitted = house_lm.predict(house[predictors])
RMSE = np.sqrt(mean_squared_error(house[outcome], fitted))
r2 = r2_score(house[outcome], fitted)
print(f'RMSE: {RMSE.0f}')
print(f'r2: {r2:.4f}')
```
We can also use statsmodels to analyze the regression model more deeply. The pandas method assign, as demonstrated here, adds a constant column with a value of 1 to the predictors. This is necessary to model the intercept.
```
model = sm.OLS(house[outcome], house[predictors].assign(const=1))
results = model.fit()
results.summary()
```

Another helpful metric is the coefficient of determination, also known as R-squared statistics $R^2$. R-squared ranges from 0 to 1 and indicates the proportion of data variation explained by the model. It primarily benefits explanatory regression, helping evaluate how effectively the model represents the data.

$R^2 = 1 - \frac{\sum^n_{i=1} \big( y_i - \hat{y_i} \big)^2}{\sum^n_{i=1} \big( y_i - \bar{y_i} \big)^2}$

The denominator is directly related to Y’s variance. R’s output includes the adjusted R-squared value, which accounts for degrees of freedom and effectively penalizes adding additional predictors in a model. This value rarely differs significantly from R-squared in multiple regression with large data sets.

Using the estimated coefficients, both R and statsmodels provide the standard error of the coefficients (SE) along with the t-statistic. This t-statistic, along with its counterpart—the p-value—indicates how much a coefficient is deemed "statistically significant", meaning it falls outside the randomness expected from a mere chance arrangement of predictor and target variables. A higher t-statistic combined with a lower p-value signals a more significant predictor.

In practice, data scientists mainly use the t-statistic to decide if a predictor should be included in a model. High t-statistics (with p-values near 0) suggest retention, while low t-statistics indicate potential removal.

Cross-Validation

Classic statistical regression metrics (R-squared, F-statistics, and p-values) are all considered “in-sample” metrics, as they are calculated using the same data used to fit the model. Intuitively, the same dataset can be separated to apply the model for validation, allowing us to assess its performance. Typically, we use most of the data to fit the model and a smaller portion (holdout) to test it.

This holdout sample concept could be expanded to include multiple sequential holdout samples. The basic algorithm for k-fold cross-validation is outlined as follows:

Reserve $1/k$ of the data as a holdout sample.
Train the model using the remaining data.
Evaluate the model on the $1/k$ holdout and document the necessary model assessment metrics.
Reintroduce the initial $1/k$ of the data, then set aside the following $1/k$ (excluding records selected previously).
Repeat steps 2 and 3.
Continue this process until every record has served in the holdout.
Calculate the average or otherwise consolidate the model assessment metrics.

Model Selection and Stepwise Regression

For certain problems, numerous variables can be predictors in a regression analysis. For instance, when estimating house value, factors like basement size or the year constructed can be included.

In R, it’s easy to add these to the regression equation.

house_full <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms +
                 Bedrooms + BldgGrade + PropertyType + NbrLivingUnits +
                 SqFtFinBasement + YrBuilt + YrRenovated +
                 NewConstruction, data=house, na.action=na.omit)

In Python, we need to convert the categorical and boolean variables into numbers.

predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade','PropertyType', 'NbrLivingUnits', 'SqFtFinBasement', 'YrBuilt','YrRenovated', 'NewConstruction']
  
X = pd.get_dummies(house[predictors], drop_first=TRUE)
X['NewConstruction'] = [1 if nc else 0 for nc in X['NewConstruction']]
  
house_full = sm.OLS(house[outcome], X.assign(const=1))
results = house_full.fit()
results.summary()

However, adding additional variables doesn't automatically improve our model. Statisticians follow the principle of Occam’s Razor when selecting a model: everything else being equal, a simpler model is preferable to a more complex one.

Moreover, adding extra variables always lowers RMSE and raises R-squared for the training data. Therefore, these variables are unsuitable for guiding model selection.

Prediction Using Regression

Regression in data science aims to predict. While it’s a well-established statistical method, it often focuses more on explanatory modeling than prediction.

Key Terms for Prediction Using Regression

Prediction Interval
- An interval of uncertainty surrounding an individual predicted value.
Extrapolation
- Extending a model beyond the range of the data used for fitting.

The Dangers of Extrapolation

Regression models should not be used to make predictions outside the data range (except for time series forecasting).

The model is valid only for predictor values with sufficient data. However, if data is available, other issues may arise. For example, using the previous model_lm to predict a 5,000-square-foot empty lot results in absurd predictions, such as –521,900 + 5,000 × –.0605 = –$522,202. This occurs because the data includes only parcels with buildings, lacking records for vacant land. Thus, the model cannot predict the sales price for such properties.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day120 - STAT Review: Regression and Prediction (2)

Practical Statistics for Data Scientists: Assessing the Model, Cross Validation, Model Selection, and Prediction Using Regression

Assessing the Model - RMSE, RSE, R-squared

Cross-Validation

Model Selection and Stepwise Regression

Prediction Using Regression

Key Terms for Prediction Using Regression

The Dangers of Extrapolation

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)