Day124 - STAT Review: Regression and Prediction (6)
Practical Statistics for Data Scientists: Partial Residual Plots and Nonlinearity, Polynomial & Spline Regression, and Generalized Additive Models
Partial Residual Plots and Nonlinearity
A partial residual plot helps visualize how well a single predictor explains the relationship with the outcome after accounting for all other predictors. This allows us to detect nonlinearity in the predictor-response relationship.
A partial residual can be seen as a “synthetic outcome" value that combines the prediction of a single predictor with the actual residual from the full regression equation.
For predictor $X_i$, the partial residual includes the ordinary residual plus the regression term associated with $X_i$:
Where:
-
Residual: $\text{Residual} = y_{\text{actual}} - y_{\text{predicted}}$
-
$\hat{\beta_j}X_j$ = Regression erm for the specific predictor.
This isolates the effect of $X_j$ on the response variable.
Example: King County Housing Data
Take the example of King County housing data. If that variable SqFtTotLiving (representing house size) truly exhibited a linear relationship, the residual plot would appear as a straight line.
-
In R, the
predict
function has an option to return the individual regression terms $\hat{b_i}X_i$.terms <- predict(lm_98105, type='terms') partial_resid <- resid(lm_98105) + terms
The partial residual plot displays the predictor $X_i$ on the x-axis and the partial residuals on the y-axis. Using ggplot2 allows for the addition of a smooth curve for the partial residuals.
df <- data.frame(SqFtTotLiving = house_98105[, 'SqFtTotLiving'] Terms = terms[, 'SqFtTotLiving'], PartialResid = partial_resid[, 'SqFtTotLiving']) ggplot(df, aes(PSqFtTotLiving, PartialResid)) + geom_point(shape=1) + scale_shape(solid = FALSE) + geom_smooth(linetype=2) + geom_line(aes(SqFtTotLiving, Terms))
-
In Python, the
statsmodels
package includes thesm.graphics.plot_ccpr
method for generating a partial residual plot.sm.graphcis,plot_ccpr(result_98105, 'SqFtTotLiving')

However, the actual plot reveals that the model underestimates prices for small houses (those under 1,000 sq ft) and overestimates prices for medium-sized houses (within the range of 2,000 to 3,000 sq ft). Furthermore, the impact of adding 500 square feet to a small home is significantly greater than the same addition in a larger home.
Polynomial and Spline Regression
The relationship between a response variable and a predictor variable is not always linear. For instance, the response to drug dosage is often nonlinear; doubling the dosage rarely leads to a doubled response. Similarly, product demand is not a linear function of marketing spending; demand often reaches saturation. Regression can be adjusted to account for these nonlinear effects.
Key Terms For Nonlinear Regressions
- Polynomial Regression
- Adds polynomial terms (squares, cubes, etc.) to a regression model.
- Spline Regression
- Fits a smooth curve composed of a series of polynomial segments.
- Knots
- Values that separate the segments of the spline.
- Generalized Additive Models
- Spline models that automatically determine knots.
- = GAM
Polynomial
Polynomial regression extends linear regression by adding squared, cubic, or higher-order terms of a predictor.
A quadratic regression (degree = 2) takes the form:
A cubic regression (degree = 3) would add $X^3$, and so on.
-
In R, polynomial regression can be performed using the
poly
function. For example, the following fits a quadratic polynomial forSqFtTotLiving
with the King County housing data.lm(AdjSalePrice ~ poly(SqFtTotLiving, 2) + SqFtLot + BldgGrade + Bathrooms + Bedrooms, data=house_98105)) --- Call: lm(formula = AdjSalePrice ~ poly(SqFtTotLiving, 2) + SqFtLot + BldgGrade + Bathrooms + Bedrooms, data = house_98105) Coefficients: (Intercept) poly(SqFtTotLiving, 2)1 poly(SqFtTotLiving, 2)2 -402530.47 3271519.49 776934.02 SqFtLot BldgGrade Bathrooms 32.56 135717.06 -1435.12 Bedrooms -9191.94
-
In Python, using
statsmodels
, we include the squared term in the model definition withI(SqFtTotLiving**2)
:import statsmodels.formula.api as smf # Quadratic (degree=2) polynomial regression model_poly = smf.ols(formula='AdjSalePrice ~ SqFtTotLiving + ' + '+ I(SqFtTotLiving**2) + ' + 'SqFtLot + Bathrooms + Bedrooms + BldgGrade', data=house_98105) result_poly = model_poly.fit() print(result_poly.summary())
The
statsmodels
implementation is limited to linear terms. The source code includes an implementation that works for polynomial regression as well.The intercept and polynomial coefficients vary from those in R because of different implementations. However, the other coefficients and predictions remain the same.

A polynomial regression fit for the variable SqFtTotLiving
(solid line) is compared to a smooth line (dashed line).
The model features two coefficients for SqFtTotLiving
: one represents the linear effect ($X$) and the other, the quadratic effect ($X^2$). If the $X^2$ coefficient is significant, this indicates a curved relationship. Polynomial regression enables us to fit curves while maintaining a linear model structure.
Splines
Polynomial regression can model curvature, but using higher-degree polynomials (like $X^3$, $X^4$, etc.) may result in instability and overfitting. In contrast, splines provide a more reliable solution.
The technical definition of a spline is a sequence of piecewise continuous polynomials. Polynomial pieces connect smoothly at fixed points, called knots. Splines are more complex than polynomial regression, with statistical software typically managing the fitting. The R package splines
includes the function bs
to create a b-spline term in regression models. For instance, this adds a b-spline term to the house regression model.
-
In R
library(splines) knots <- quantile(house_98105$SqFtTotLiving, p=c(.25, .5, .75)) lm_spline <- lm(AdjSalePrice ~ bs(SqFtTotLiving, knots=knots, degree=3) + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house_98105)
To set up the model, define two parameters: polynomial degree and knot locations. The predictor
SqFtTotLiving
uses a cubic spline (degree=3). Thebs
function places knots at boundaries, with extra knots at the lower quartile, median quartile, and upper quartile. -
In Python, the
statsmodels
formula interface allows for using splines like in R. We define the b-spline usingdf
, which represents the degrees of freedom.from patsy import dmatrix # Create spline basis (df=6, degree=3) spline_basis = dmatrix("bs(SqFtTotLiving, df=6, degree=3)", data=house_98105) # Fit spline regression formula = 'AdjSalePrice ~ bs(SqFtTotLiving, df=6, degree=3) + SqFtLot + Bathrooms + Bedrooms + BldgGrade' model_spline = smf.ols(formula=formula, data=house_98105) result_spline = model_spline.fit() print(result_spline.summary())

Unlike a linear term, where the coefficient has a clear interpretation, the coefficients for a spline term are not easily understood. Instead, visual displays are more effective for conveying the nature of the spline fit. Compared to the polynomial model, the spline model aligns more closely with the data’s smoothness, showcasing the greater flexibility of splines. In this context, the spline fit better represents the data.
However, this does not necessarily indicate that spline regression is a better model: it doesn’t make economic sense for very small homes (less than 1,000 square feet) to have a higher value than slightly larger homes. This may be an artifact of a confounding variable.
Generalized Additive Models
If you suspect a nonlinear relationship between the response variable and a predictor—whether based on prior knowledge or regression diagnostics—polynomial terms might lack the necessary flexibility, and spline terms require precise knot specification. In this context, Generalized Additive Models (GAM) offer a flexible approach that can automatically fit a spline regression.
-
In R, we may use
mgcv
package to fit a GAM model to the housing data.library(mgcv) lm_gam <- gam(AdjSalePrice ~ s(SqFtTotLiving) + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house_98105)
The term
s(SqFtTotLiving)
tells thegam
function to find the “best” knots for a spline term. -
In python, we can use the
pyGAM
package.import numpy as np import pandas as pd from pygam import LinearGAM, s, l # Define predictors and outcome predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade'] outcome = 'AdjSalePrice' X = house_98105[predictors].values y = house_98105[outcome].values # Define GAM model: Apply a smooth function (s) to SqFtTotLiving, keep others linear (l) gam = LinearGAM(s(0, n_splines=12) + l(1) + l(2) + l(3) + l(4)) # Perform grid search to optimize smoothing parameters gam.gridsearch(X, y)
s(0, n_splines=12)
:- The first predictor (SqFtTotLiving) is modeled nonlinearly using a spline.
- The number of splines (
n_splines=12
) controls smoothness.
l(1) + l(2)+ l(3) + l(4)
:- Other variables (
SqftLot
,Bathrooms
, etc.) remain linear
- Other variables (
</center>
The function automatically finds the best shape without manually choosing polynomial degrees or knots.
Comparing GAM to Other Methods
Method | Pros | Cons |
---|---|---|
Linear Regression | Simple, interpretable | Cannot capture nonliner trends |
Polynomial Regression | Models curvature | Overfits with high-degree polynomials |
Splines | Flexible, smooth | Requires manual knot selection |
GAMs | Auto-select splines, interpretable | Computationally heavier |
GAMs are useful when relationships are nonlinear, but we don’t want to manually define transformations.
Summary
Regression is a widely used statistical method, built on a linear foundation, where each predictor variable has a coefficient indicating a linear relationship with the outcome. Advanced regression forms, like polynomial and spline regression, enable nonlinear relationships.
In data science, regression aims to predict new data values, using metrics based on predictive accuracy for out-of-sample data. Variable selection methods reduce dimensionality to create more compact models.
Leave a comment