Day122 - STAT Review: Regression and Prediction (4)
Practical Statistics for Data Scientists: Interpreting the Regression Equation - Correlation, Multicollinearity, Confounding Variables, and Interactions
Interpreting the Regression Equation
In data science, regression is primarily used to predict a dependent (outcome) variable. However, it can also be beneficial to analyze the equation itself, as it provides insights into the relationship between the predictors and the outcome.
Key Terms for Interpreting the Regression Equations
- Correlated Variables
- Variables that change together—when one rises, the other does as well, and vice versa (in the case of negative correlation, as one rises, the other falls). Strongly correlated predictor variables complicate the interpretation of individual coefficients.
- Multicollinearity
- Perfect or near-perfect correlation among predictor variables can make regression unstable or uncomputable.
- = collinearity
- Confounding Variables
- A critical predictor that, when omitted, leads to spurious relationships in a regression equation.
- Main Effects
- The relationship between a predictor and the outcome variable, independent of other variable influences.
- Interactions
- An interdependent relationship between two or more predictors and the response.
- An interdependent relationship between two or more predictors and the response.
Correlated Predictors
In multiple regression, the predictor variables are frequently correlated with one another. The example dataset below illustrates the regression coefficients for the model step_lm
.
-
R
step_lm$coefficients --- (Intercept) SqFtTotLiving Bathrooms 6.178645e+06 1.992776e+02 4.239616e+04 Bedrooms BldgGrade PropertyTypeSingle Family -5.194738e+04 1.371596e+05 2.291206e+04 PropertyTypeTownhouse SqFtFinBasement YrBuilt 8.447916e+04 7.046975e+00
-
Python
print(f'Intercepts: {best_model.intercept_:.3f}') print('Coefficients') for name, coef in zip(best_variables, best_model.coef_): print(f' {name}: {coef}')
The coefficient for
Bedroom
is negative, meaning adding a bedroom decreases a house's value. This occurs because predictor variables are correlated: larger houses have more bedrooms, and size drives value, not the number of bedrooms. For two homes of the same size, a home with more but smaller bedrooms is likely less desirable.
Correlated predictors complicate regression interpretation and inflate standard errors.
Bedrooms
, house size
, and bathrooms
are correlated. This is shown in an R example, fitting a regression without the variables `SqFtTotLiving`, `SqFtBasement`, and `Bathrooms`.
-
In R
update(step_lm, . ~ . - SqFtTotLiving - SqFtFinBasement - Bathrooms) --- Call: lm(formula = AdjSalePrice ~ Bedrooms + BldgGrade + PropertyType + YrBuilt, data = house, na.action = na.omit) Coefficients: (Intercept) Bedrooms 4913973 27151 BldgGrade PropertyTypeSingle Family 248998 -19898 PropertyTypeTownhouse YrBuilt -47355 -3212
The update function adds or removes model variables. The coefficient for bedrooms is positive, aligning with our expectations. This change can make the coefficient of
Bedrooms
more meaningful, as it’s no longer competing with these strongly correlated predictors.
Multicollinearity
An extreme case of correlated variables results in multicollinearity, a condition in which predictor variables are redundant. Perfect multicollinearity arises when one predictor variable can be expressed as a linear combination of others.
Multicollinearity occurs when:
- A variable is mistakenly included multiple times.
- $P$ dummies instead of $P−1$ dummies are created from a factor variable.
- The two variables are almost perfectly correlated with each other.
Multicollinearity in regression must be addressed: variables should be eliminated until multicollinearity is resolved. A regression does not have a well-defined solution in cases of perfect multicollinearity.
Confounding Variables
Confounding variables can lead to omission errors, as vital variables may be omitted from the regression, potentially distorting their true relationship. Naive coefficient interpretation can yield invalid conclusions.
Refer to the previously discussed King County regression equation, house_lm
. In the King County house price regression example, location is a major confounding factor. The original model did not include a variable for location, even though neighborhood desirability heavily influences real estate prices. Omitting location leads to misleading coefficients for SqFtLot
, Bathrooms
, and Bedrooms
. We introduce ZipGroup
, which classifies zip codes into five price tiers (1 = least expensive, 5 = most expensive) to account for location.
-
In R
lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType + ZipGroup, data = house, na.action = na.omit) --- Coefficients: (Intercept) SqFtTotLiving -6.666e+05 2.106e+02 SqFtLot Bathrooms 4.550e-01 5.928e+03 Bedrooms BldgGrade -4.168e+04 9.854e+04 PropertyTypeSingle Family PropertyTypeTownhouse 1.932e+04 -7.820e+04 ZipGroup2 ZipGroup3 5.332e+04 1.163e+05 ZipGroup4 ZipGroup5 1.784e+05 3.384e+05
-
In Python
predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms','BldgGrade', 'PropertyType', 'ZipGroup'] outcome = 'AdjSalePrice' X = pd.get_dummies(house[predictors], drop_first=True) confounding_lm = LinearRegression() confounding_lm.fit(X, house[outcome]) print(f'Intercept: {confounding_lm.intercept_:.3f}') print('Coefficients') for name, coef in zip(X.columns, confounding_lm.coef_): print(f' {name}: {coef}')
After adding ZipGroup to the model, we observe the following changes:
Predictor | Before ZipGroup |
After ZipGroup |
Explanation |
---|---|---|---|
SqFtLot |
Negative | Positive | More land increases property value, but without controlling for location, homes in urban areas (smaller lots) were overvalued. |
Bathrooms |
Negative | Positive $\sim $5928$ | A bathroom adds value, but in the original model, homes with more bathrooms might have been in lower-value areas. |
Bedrooms |
Negative | Still Negative | Adding bedrooms often reduces the size of other rooms, making the house less desirable. |
ZipGroup |
Not included | Increasing house value | Homes in expensive areas (ZipGroup 5) are worth $\sim $340,000$ more than those in ZipGroup1. |
Even after controlling for location, the negative coefficient for Bedrooms
persists. This might have occurred because:
- House Size Matters More than Bedroom Count
- Holding
SqFtTotLiving
constant, a house with more bedrooms has smaller rooms, which may be less appealing. - Example: A 2,000 sqft house with three bedrooms (larger rooms) is often more desirable than a 2,000 sqft house with five (smaller rooms).
- Holding
- Multicollinearity with Other Features
- Bedrooms correlate with house size, bathrooms, and overall layout.
- Since
SqFtTotLiving
is already in the model, Bedrooms may not add independent predictive power.
Interactions and Main Effects on Regression
In multiple regression analysis, we often include main effects, which are the predictor variables directly related to the response variable. However, these predictors do not act independently- one variable’s effect can depend on another's level. This dependency is captured using interaction terms.
The key assumption in a standard linear regression model is that the relationship between each predictor and the outcome is independent of other predictors. This assumption is often not valid in real-world data.
For example, in the King County Housing Data, we included variables like:
- SqFtTotLiving (house size)
- ZipGroup (location categorized into five groups)
- Bathrooms, Bedrooms, BldgGrade, etc. (other housing features)
A big house in a high-end neighborhood (ZipGroup 5) is likely worth much more per square foot than the same-sized house in a low-rent district (ZipGroup 1). This suggests that house size and location interact – the impact of adding square footage depends on location. We introduce interaction terms to model this dependency.
-
In R, interactions are added using the * operator.
lm(formula = AdjSalePrice ~ SqFtTotLiving * ZipGroup + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data = house, na.action = na.omit) --- Coefficients: (Intercept) SqFtTotLiving -4.853e+05 1.148e+02 ZipGroup2 ZipGroup3 -1.113e+04 2.032e+04 ZipGroup4 ZipGroup5 2.050e+04 -1.499e+05 SqFtLot Bathrooms 6.869e-01 -3.619e+03 Bedrooms BldgGrade -4.180e+04 1.047e+05 PropertyTypeSingle Family PropertyTypeTownhouse 1.357e+04 -5.884e+04 SqFtTotLiving:ZipGroup2 SqFtTotLiving:ZipGroup3 3.260e+01 4.178e+01 SqFtTotLiving:ZipGroup4 SqFtTotLiving:ZipGroup5 6.934e+01 2.267e+02
This expands to include:
- The main effects:
SqFtTotLiving
,ZipGroup
- Their interaction:
SqFtTotLiving:ZipGroup
- The main effects:
-
In Python
model = smf.ols(formula='AdjSalePrice ~ SqFtTotLiving*ZipGroup + SqFtLot + ' + 'Bathrooms + Bedrooms + BldgGrade + PropertyType', data=house) results = model.fit() results.summary()
The result will be as follows.
Variable Estimate Interpretation SqFtTotLiving $114.80 Adding 1 square foot in ZipGroup 1 increases price by $114.80 ZipGroup2 -$11,130 Homes in ZipGroup 2 are worth less than ZipGroup 1 ZipGroup5 -$149,900 Homes in ZipGroup 5 start lower but interact positively with size SqFtTotLiving:ZipGroup2 $32.60 Additional price per square foot in ZipGroup 2 SqFtTotLiving:ZipGroup5 $226.70 Additional price per square foot in ZipGroup 5 Thus, in ZipGroup 5, the effect of house size is:
$\text{SqFtTotLiving effect} + \text{Interaction effect} = 114.8 + 226.7 = 341.50$
This means that adding a square foot in ZipGroup 5 increases the price by $341.5, which is almost three times the effect in the least expensive area (ZipGroup 1).
We can affirm that an interaction term is necessary when the relationship between the variables and the response is interdependent.
Leave a comment