Day122 - STAT Review: Regression and Prediction (4)

February 16, 2025 6 minute read

Practical Statistics for Data Scientists: Interpreting the Regression Equation - Correlation, Multicollinearity, Confounding Variables, and Interactions

CD18FAC9-D116-4808-B910-2DF0D65D5A4C_1_105_c

Interpreting the Regression Equation

In data science, regression is primarily used to predict a dependent (outcome) variable. However, it can also be beneficial to analyze the equation itself, as it provides insights into the relationship between the predictors and the outcome.

Key Terms for Interpreting the Regression Equations

Correlated Variables
Variables that change together—when one rises, the other does as well, and vice versa (in the case of negative correlation, as one rises, the other falls). Strongly correlated predictor variables complicate the interpretation of individual coefficients.
Multicollinearity
- Perfect or near-perfect correlation among predictor variables can make regression unstable or uncomputable.
- = collinearity
Confounding Variables
- A critical predictor that, when omitted, leads to spurious relationships in a regression equation.
Main Effects
- The relationship between a predictor and the outcome variable, independent of other variable influences.
Interactions
- An interdependent relationship between two or more predictors and the response.

Correlated Predictors

In multiple regression, the predictor variables are frequently correlated with one another. The example dataset below illustrates the regression coefficients for the model step_lm.

step_lm$coefficients
---
              (Intercept)             SqFtTotLiving                 Bathrooms
             6.178645e+06              1.992776e+02              4.239616e+04
                 Bedrooms                 BldgGrade PropertyTypeSingle Family
            -5.194738e+04              1.371596e+05              2.291206e+04
    PropertyTypeTownhouse           SqFtFinBasement                   YrBuilt
             8.447916e+04              7.046975e+00       
  

Python
```
print(f'Intercepts: {best_model.intercept_:.3f}')
print('Coefficients')
for name, coef in zip(best_variables, best_model.coef_):
  print(f' {name}: {coef}')
```
The coefficient for Bedroom is negative, meaning adding a bedroom decreases a house's value. This occurs because predictor variables are correlated: larger houses have more bedrooms, and size drives value, not the number of bedrooms. For two homes of the same size, a home with more but smaller bedrooms is likely less desirable.

Correlated predictors complicate regression interpretation and inflate standard errors.

Bedrooms, house size, and bathrooms are correlated. This is shown in an R example, fitting a regression without the variables `SqFtTotLiving`, `SqFtBasement`, and `Bathrooms`.

In R

update(step_lm, . ~ . - SqFtTotLiving - SqFtFinBasement - Bathrooms)
---
Call:
lm(formula = AdjSalePrice ~ Bedrooms + BldgGrade + PropertyType +
    YrBuilt, data = house, na.action = na.omit)
  
Coefficients:
              (Intercept)                   Bedrooms
                  4913973                      27151
                BldgGrade  PropertyTypeSingle Family
                   248998                     -19898
    PropertyTypeTownhouse                    YrBuilt
                   -47355                      -3212

The update function adds or removes model variables. The coefficient for bedrooms is positive, aligning with our expectations. This change can make the coefficient of Bedrooms more meaningful, as it’s no longer competing with these strongly correlated predictors.

Multicollinearity

An extreme case of correlated variables results in multicollinearity, a condition in which predictor variables are redundant. Perfect multicollinearity arises when one predictor variable can be expressed as a linear combination of others.

Multicollinearity occurs when:

A variable is mistakenly included multiple times.
$P$ dummies instead of $P−1$ dummies are created from a factor variable.
The two variables are almost perfectly correlated with each other.

Multicollinearity in regression must be addressed: variables should be eliminated until multicollinearity is resolved. A regression does not have a well-defined solution in cases of perfect multicollinearity.

Confounding Variables

Confounding variables can lead to omission errors, as vital variables may be omitted from the regression, potentially distorting their true relationship. Naive coefficient interpretation can yield invalid conclusions.

Refer to the previously discussed King County regression equation, house_lm. In the King County house price regression example, location is a major confounding factor. The original model did not include a variable for location, even though neighborhood desirability heavily influences real estate prices. Omitting location leads to misleading coefficients for SqFtLot, Bathrooms, and Bedrooms. We introduce ZipGroup, which classifies zip codes into five price tiers (1 = least expensive, 5 = most expensive) to account for location.

In R

lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + ZipGroup, data = house, na.action = na.omit)
---
Coefficients:
              (Intercept)              SqFtTotLiving
               -6.666e+05                  2.106e+02
                  SqFtLot                  Bathrooms
                4.550e-01                  5.928e+03
                 Bedrooms                  BldgGrade
               -4.168e+04                  9.854e+04
PropertyTypeSingle Family      PropertyTypeTownhouse
                1.932e+04                 -7.820e+04
                ZipGroup2                  ZipGroup3
                5.332e+04                  1.163e+05
                ZipGroup4                  ZipGroup5
                1.784e+05                  3.384e+05

In Python

predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms','BldgGrade', 'PropertyType', 'ZipGroup']
outcome = 'AdjSalePrice'
  
X = pd.get_dummies(house[predictors], drop_first=True)
  
confounding_lm = LinearRegression()
confounding_lm.fit(X, house[outcome])
  
print(f'Intercept: {confounding_lm.intercept_:.3f}')
print('Coefficients')
for name, coef in zip(X.columns, confounding_lm.coef_):
  print(f' {name}: {coef}')

After adding ZipGroup to the model, we observe the following changes:

Predictor	Before `ZipGroup`	After `ZipGroup`	Explanation
`SqFtLot`	Negative	Positive	More land increases property value, but without controlling for location, homes in urban areas (smaller lots) were overvalued.
`Bathrooms`	Negative	Positive $\sim $5928$	A bathroom adds value, but in the original model, homes with more bathrooms might have been in lower-value areas.
`Bedrooms`	Negative	Still Negative	Adding bedrooms often reduces the size of other rooms, making the house less desirable.
`ZipGroup`	Not included	Increasing house value	Homes in expensive areas (ZipGroup 5) are worth $\sim $340,000$ more than those in ZipGroup1.

Even after controlling for location, the negative coefficient for Bedrooms persists. This might have occurred because:

House Size Matters More than Bedroom Count
- Holding SqFtTotLiving constant, a house with more bedrooms has smaller rooms, which may be less appealing.
- Example: A 2,000 sqft house with three bedrooms (larger rooms) is often more desirable than a 2,000 sqft house with five (smaller rooms).
Multicollinearity with Other Features
- Bedrooms correlate with house size, bathrooms, and overall layout.
- Since SqFtTotLiving is already in the model, Bedrooms may not add independent predictive power.

Interactions and Main Effects on Regression

In multiple regression analysis, we often include main effects, which are the predictor variables directly related to the response variable. However, these predictors do not act independently- one variable’s effect can depend on another's level. This dependency is captured using interaction terms.

The key assumption in a standard linear regression model is that the relationship between each predictor and the outcome is independent of other predictors. This assumption is often not valid in real-world data.

For example, in the King County Housing Data, we included variables like:

SqFtTotLiving (house size)
ZipGroup (location categorized into five groups)
Bathrooms, Bedrooms, BldgGrade, etc. (other housing features)

A big house in a high-end neighborhood (ZipGroup 5) is likely worth much more per square foot than the same-sized house in a low-rent district (ZipGroup 1). This suggests that house size and location interact – the impact of adding square footage depends on location. We introduce interaction terms to model this dependency.

In R, interactions are added using the * operator.

lm(formula = AdjSalePrice ~ SqFtTotLiving * ZipGroup + SqFtLot +
    Bathrooms + Bedrooms + BldgGrade + PropertyType, data = house,
    na.action = na.omit)
---
Coefficients:
              (Intercept)              SqFtTotLiving
               -4.853e+05                  1.148e+02
                ZipGroup2                  ZipGroup3
               -1.113e+04                  2.032e+04
                ZipGroup4                  ZipGroup5
                2.050e+04                 -1.499e+05
                  SqFtLot                  Bathrooms
                6.869e-01                 -3.619e+03
                 Bedrooms                  BldgGrade
               -4.180e+04                  1.047e+05
PropertyTypeSingle Family      PropertyTypeTownhouse
                1.357e+04                 -5.884e+04
  SqFtTotLiving:ZipGroup2    SqFtTotLiving:ZipGroup3
                3.260e+01                  4.178e+01
  SqFtTotLiving:ZipGroup4    SqFtTotLiving:ZipGroup5
                6.934e+01                  2.267e+02

This expands to include:

The main effects: SqFtTotLiving, ZipGroup
Their interaction: SqFtTotLiving:ZipGroup

In Python

model = smf.ols(formula='AdjSalePrice ~ SqFtTotLiving*ZipGroup + SqFtLot + ' + 
                'Bathrooms + Bedrooms + BldgGrade + PropertyType', data=house)
results = model.fit()
results.summary()

The result will be as follows.

Variable	Estimate	Interpretation
SqFtTotLiving	$114.80	Adding 1 square foot in ZipGroup 1 increases price by $114.80
ZipGroup2	-$11,130	Homes in ZipGroup 2 are worth less than ZipGroup 1
ZipGroup5	-$149,900	Homes in ZipGroup 5 start lower but interact positively with size
SqFtTotLiving:ZipGroup2	$32.60	Additional price per square foot in ZipGroup 2
SqFtTotLiving:ZipGroup5	$226.70	Additional price per square foot in ZipGroup 5

Thus, in ZipGroup 5, the effect of house size is:

$\text{SqFtTotLiving effect} + \text{Interaction effect} = 114.8 + 226.7 = 341.50$

This means that adding a square foot in ZipGroup 5 increases the price by $341.5, which is almost three times the effect in the least expensive area (ZipGroup 1).

We can affirm that an interaction term is necessary when the relationship between the variables and the response is interdependent.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day122 - STAT Review: Regression and Prediction (4)

Practical Statistics for Data Scientists: Interpreting the Regression Equation - Correlation, Multicollinearity, Confounding Variables, and Interactions

Interpreting the Regression Equation

Key Terms for Interpreting the Regression Equations

Correlated Predictors

Multicollinearity

Confounding Variables

Interactions and Main Effects on Regression

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)