Day123 - STAT Review: Regression and Prediction (5)
Practical Statistics for Data Scientists: Regression Diagnostics- Outliers, Influential Observations, and Heteroskedasticity
Regression Diagnostics
In explanatory modeling, additional steps are taken to assess model fit, mainly through residual analysis. While these steps don’t directly address predictive accuracy, they offer valuable insights for predictions.
Key Terms for Regression Diagnostics
- Standardized Residuals
- Residuals are divided by the standard error of those residuals.
- Outliers
- Records distant from the data (or the predicted outcome).
- Influential value
- A value or record that significantly impacts the regression equation.
- Leverage
- The influence of a single record on a regression equation
- = hat-value
- Non-normal Residuals
- Non-normally distributed residuals may violate specific technical regression criteria but are typically not an issue in data science.
- Heteroskedasticity
- When specific ranges of the outcome display residuals with higher variance, it may suggest that a predictor is absent from the equation.
- Partial Residual Plots
- A diagnostic plot illustrates the connection between the outcome variable and one predictor.
- = added variables plot
Outliers
An extreme value, an outlier, is significantly distant from most other observations. Outliers should be considered in estimates of location and variability as they can also cause problems in regression models.
In regression analysis, an outlier is a data point that significantly deviates from the predicted value. We detect outliers using the standardized residual, calculated by dividing the residual by its standard error. Standardized residuals represent how many standard errors a data point is from the regression line.
$\text{Residual} = y_{actual} - y_{predicted}$
A common rule of thumb:
- If a standardized residual is greater than ±3, the observation is considered a potential outlier.
- More conservative thresholds, like ±2.5, can also be used.
For example, let’s utilize a dataset we used in prior postings to fit a regression model to the King County house sales data for all sales in zip code 98105.
-
In R
house_98105 <- house[house$ZipCode == 98105,] lm_98105 <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house_98105)
-
In Python
import statsmodels.api as sm import pandas as pd house_98105 = house.loc[house['ZipCode'] == 98105, ] predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BdgGrade'] outcome = 'AdjSalePrice' house_outlier = sm.OLS(house_98105[outcome], house_98105[predictors].assign(const=1)) result_98105 = house_outlier.fit()
-
In R, we use the
rstandard
function to extract standardized residuals and utilize theorder
function to identify the index of the smallest residual.sresid <- rstandard(lm_98105) idx <- order(sresid) --- sresid[idx[1]] 20429 -4.326732
-
In Python, we use
statsmodels
andOLSInfluence
to analyze the residuals.from statsmodels.stats.outliers_influence import OLSInfluence influence = OLSInfluence(result_98105) sresiduals = influence.resid_studentized_internal # Standardized residuals # Identify the most extreme negative residual outlier_index = sresiduals.idxmin() # get the index of the most extreme outlier outlier_value = sresiduals.min() # get the residual value print(f'Index of most extreme negative residual: {outlier_index}') print(f'Standardized residual value: {outlier_value}')
-
The biggest overestimate from the model is more than four standard errors above the regression line, corresponding to an overestimate of $757,754. The original data record corresponding to this outlier is as follows in R:
house_98105[idx[1], c('AdjSalePrice', 'SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade')] --- AdjSalePrice SqFtTotLiving SqFtLot Bathrooms Bedrooms BldgGrade (dbl) (int) (int) (dbl) (int) (int) 20429 119748 2900 7276 3 6 7
-
In Python
outlier = house_98105.loc[sredisuals.idxmin(), :] print('AdjSalePrice', outlier[outcome]) print(outlier[predictors])
For a 2,900 sq ft house, a sale price of $119,748 is extremely low for this zip code. Upon further investigation, it turns out that this sale only involved a partial interest in the property (e.g., a legal or financial transaction that does not reflect full ownership). This is a data issue and should be removed from the analysis.
Influential Observations
An influential observation is a data point that, if removed, would significantly alter the regression equation. Unlike outliers, which exhibit large residuals, influential points do not necessarily have extreme residuals. Instead, they exert substantial leverage on the regression model.
For example, consider a dataset with a general trend. A single point far from the others in the x-direction but still close to the regression line pulls the line toward itself. If this point were removed, the regression line would shift significantly. This is an influential point.
That data value significantly impacts the regression, even though it is not associated with a substantial outlier (from the complete regression). This data value is considered to have high leverage on the regression.
How to Detect Influential Observations?
Statisticians use leverage and Cook’s distance to measure influence.
- Hat-Values (Leverage)
- Measures how far a point’s x-value is from the mean of all x-values.
- A high hat-value means the point has high leverage, which strongly affects the regression. (> $2(P+1)/n$)
- Cook’s Distance
- Combines leverage and residual size to measure influence.
- A rule of thumb: Cook's distance > $4/(n-P-1)$ indicates a highly influential observation.
We can visualize these metrics in a single plot.
-
In R
std_resid <- rstandard(lm_98105) cooks_D <- cooks.distance(lm_98105) hat_values <-hatvalues(lm_98105) plot(subset(hat_values, cooks_D > 0.08), subset(std_resid, cooks_D > 0.08), xlab='hat_values', ylab='std_resid', cex=10*sqrt(subset(cooks_D, cooks_D > 0.08)), pch=16, col='lightgrey') points(hat_values, std_resid, cex=10*sqrt(cooks_D)) abline(h=c(-2.5, 2.5), lty=2)
-
In Python
influence = OLSInfluence(result_98105) fig, ax = plt.subplots(figsize=(5, 5)) ax.axhline(-2.5, linestyle='--', color='C1') ax.axhline(2.5, linestyle='--', color='C1') ax.scatter(influence.hat_matrix_diag, influence.resid_studentized_internal, s=1000 * np.sqrt(influence.cooks_distance[0]), alpha=0.5) ax.set_xlabel('hat values') ax.set_ylabel('studentized residuals')
The hat values are plotted on the x-axis, the residuals are plotted on the y-axis, and the size of the points is related to the value of Cook’s distance.

Identifying influential observations is useful mainly in smaller datasets when fitting a regression to predict future data. In larger datasets, a single observation is unlikely to significantly impact the fitted equation, despite potential outliers. However, it remains beneficial for anomaly detection.
Heteroskedasticity, Non-Normality, and Correlated Errors
The distribution of residuals mainly matters for the reliability of statistical inference, such as hypothesis tests and p-values, which are less important for data scientists focused on predictive accuracy.
Normally distributed errors indicate a well-fitting model, while non-normally distributed errors suggest that the model may be lacking some elements. For statistical inference to be valid, residuals should ideally be normally distributed, have constant variance, and be independent. One important consideration for data scientists is the standard method of calculating confidence intervals for predictions, which relies on these assumptions about residuals.
In regression analysis, residuals (errors) are the difference between the actual and predicted values. While residual distribution doesn’t always affect predictive accuracy, understanding residual patterns can help identify issues in the model.
Heteroskedasticity
- Heteroskedasticity refers to the lack of constant residual variance across all predicted values.
- Homoskedasticity is the ideal scenario in which residual variance remains constant for all predicted values.
A scatterplot of absolute residuals vs. predicted values is a simple way to visualize heteroskedasticity.
-
The following codes in R plots the absolute residuals versus the predicted values for the
lm_98105
regression fit in previous sections.df <- data.frame(resid = residuals(lm_98105), pred = predict(lm_98105)) ggplot(df, aes(pred, abs(resid))) + geom_point() + geom_smooth()
geom_point()
→ Plots residuals.geom_smooth()
→ Adds a smooth trend line to see variance patterns.
-
In Python, the
seaborn
package has theregplot
function to create a similar figure.import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Absolute residuals vs. predicted values fig, ax = plt.subplots(figsize=(5, 5)) sns.regplot(result_98105.fittedvalues, np.abs(result_98105.resid), scatter_kws={'alpha': 0.25}, line_kws={'color': 'C1'}, lowess=True, ax=ax) ax.set_xlabel('Predicted Value') ax.set_ylabel('Absolute Residual') ax.set_title('Residual vs. Predicted Plot') plt.show()
To interpret this:
- If the residual spread widens at higher values, it suggests heteroskedasticity is present.
- Conversely, if the residuals are evenly distributed, there is no significant evidence of heteroskedasticity.
The figure below confirms that the variance of the residuals generally rises with higher-valued homes, though it remains notable for lower-valued homes as well. This plot shows that lm_98105
exhibits heteroskedastic errors.

Why Should Data Scientists Care About Heteroskedasticity?
- Heteroskedasticity may suggest that:
- Prediction errors fluctuate across various price ranges.
- Confidence intervals for predictions may not be reliable.
- An important variable might be missing from the model (e.g., a categorical variable like location).
Leave a comment