Day120 - STAT Review: Regression and Prediction (2)
Practical Statistics for Data Scientists: Assessing the Model, Cross Validation, Model Selection, and Prediction Using Regression
Assessing the Model - RMSE, RSE, R-squared
From a data science perspective, the key performance metric is the root mean square error, often abbreviated as RMSE. It denotes the square root of the average of the squared differences between predicted values ($\hat{y_i}$) and actual values.
This evaluates the model’s overall accuracy and provides a basis for comparison with other models, including those developed using machine learning techniques.
RSE, short for “Residual Standard Error,” is similar to RMSE. In this instance, there are p predictors.
The only difference is that the denominator reflects the degrees of freedom rather than the number of records. In practice, the difference between RMSE and RSE is minimal for linear regression, especially in big data applications.
-
In R, the
summary
function calculates the RSE and other metrics in a regression model.summary(house_lm) --- Call: lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data = house, na.action = na.omit) Residuals: Min 1Q Median 3Q Max -1199479 -118908 -20977 87435 9473035 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.219e+05 1.565e+04 -33.342 < 2e-16 *** SqFtTotLiving 2.288e+02 3.899e+00 58.694 < 2e-16 *** SqFtLot -6.047e-02 6.118e-02 -0.988 0.323 Bathrooms -1.944e+04 3.625e+03 -5.363 8.27e-08 *** Bedrooms -4.777e+04 2.490e+03 -19.187 < 2e-16 *** BldgGrade 1.061e+05 2.396e+03 44.277 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 261300 on 22681 degrees of freedom Multiple R-squared: 0.5406, Adjusted R-squared: 0.5405 F-statistic: 5338 on 5 and 22681 DF, p-value: < 2.2e-16
-
In Python,
scikit-learn
provides several metrics for regression and classification. We usemean_squared_error
to get RMSE andr2_score
for the coefficient of determination.fitted = house_lm.predict(house[predictors]) RMSE = np.sqrt(mean_squared_error(house[outcome], fitted)) r2 = r2_score(house[outcome], fitted) print(f'RMSE: {RMSE.0f}') print(f'r2: {r2:.4f}')
We can also use
statsmodels
to analyze the regression model more deeply. The pandas methodassign
, as demonstrated here, adds a constant column with a value of 1 to the predictors. This is necessary to model the intercept.model = sm.OLS(house[outcome], house[predictors].assign(const=1)) results = model.fit() results.summary()
Another helpful metric is the coefficient of determination, also known as R-squared statistics $R^2$. R-squared ranges from 0 to 1 and indicates the proportion of data variation explained by the model. It primarily benefits explanatory regression, helping evaluate how effectively the model represents the data.
The denominator is directly related to Y’s variance. R’s output includes the adjusted R-squared value, which accounts for degrees of freedom and effectively penalizes adding additional predictors in a model. This value rarely differs significantly from R-squared in multiple regression with large data sets.
Using the estimated coefficients, both R and statsmodels
provide the standard error of the coefficients (SE) along with the t-statistic. This t-statistic, along with its counterpart—the p-value—indicates how much a coefficient is deemed "statistically significant", meaning it falls outside the randomness expected from a mere chance arrangement of predictor and target variables. A higher t-statistic combined with a lower p-value signals a more significant predictor.
In practice, data scientists mainly use the t-statistic to decide if a predictor should be included in a model. High t-statistics (with p-values near 0) suggest retention, while low t-statistics indicate potential removal.
Cross-Validation
Classic statistical regression metrics (R-squared, F-statistics, and p-values) are all considered “in-sample” metrics, as they are calculated using the same data used to fit the model. Intuitively, the same dataset can be separated to apply the model for validation, allowing us to assess its performance. Typically, we use most of the data to fit the model and a smaller portion (holdout) to test it.
This holdout sample concept could be expanded to include multiple sequential holdout samples. The basic algorithm for k-fold cross-validation is outlined as follows:
- Reserve $1/k$ of the data as a holdout sample.
- Train the model using the remaining data.
- Evaluate the model on the $1/k$ holdout and document the necessary model assessment metrics.
- Reintroduce the initial $1/k$ of the data, then set aside the following $1/k$ (excluding records selected previously).
- Repeat steps 2 and 3.
- Continue this process until every record has served in the holdout.
- Calculate the average or otherwise consolidate the model assessment metrics.
Model Selection and Stepwise Regression
For certain problems, numerous variables can be predictors in a regression analysis. For instance, when estimating house value, factors like basement size or the year constructed can be included.
-
In R, it’s easy to add these to the regression equation.
house_full <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + NbrLivingUnits + SqFtFinBasement + YrBuilt + YrRenovated + NewConstruction, data=house, na.action=na.omit)
-
In Python, we need to convert the categorical and boolean variables into numbers.
predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade','PropertyType', 'NbrLivingUnits', 'SqFtFinBasement', 'YrBuilt','YrRenovated', 'NewConstruction'] X = pd.get_dummies(house[predictors], drop_first=TRUE) X['NewConstruction'] = [1 if nc else 0 for nc in X['NewConstruction']] house_full = sm.OLS(house[outcome], X.assign(const=1)) results = house_full.fit() results.summary()
However, adding additional variables doesn't automatically improve our model. Statisticians follow the principle of Occam’s Razor when selecting a model: everything else being equal, a simpler model is preferable to a more complex one.
Moreover, adding extra variables always lowers RMSE and raises R-squared for the training data. Therefore, these variables are unsuitable for guiding model selection.
Prediction Using Regression
Regression in data science aims to predict. While it’s a well-established statistical method, it often focuses more on explanatory modeling than prediction.
Key Terms for Prediction Using Regression
- Prediction Interval
- An interval of uncertainty surrounding an individual predicted value.
- Extrapolation
- Extending a model beyond the range of the data used for fitting.
- Extending a model beyond the range of the data used for fitting.
The Dangers of Extrapolation
Regression models should not be used to make predictions outside the data range (except for time series forecasting).
The model is valid only for predictor values with sufficient data. However, if data is available, other issues may arise. For example, using the previous model_lm
to predict a 5,000-square-foot empty lot results in absurd predictions, such as –521,900 + 5,000 × –.0605 = –$522,202. This occurs because the data includes only parcels with buildings, lacking records for vacant land. Thus, the model cannot predict the sales price for such properties.
Leave a comment