Day130 - STAT Review: Regression and Prediction (8)

February 28, 2025 4 minute read

Practical Statistics for Data Scientists: Weighted Regression, and Interactions and Main Effects in Regression in Depth

0C663914-0BC5-4CB1-B796-EECDB6AFAD47_1_105_c

Weighted Regression

Weighted regression is a variation of ordinary least squares (OLS) regression where each data point is assigned a weight.

Weighted regression is used when:

Some observations have higher precision and should be given more importance.
Rows represent aggregated cases, where some observations count more than others.

Instead of treating all observations equally, weighted regression assigns greater importance to more reliable observations.

Why Use Weighted Regression

There are two everyday use cases.

Inverse-Variance Weighting
- Some observations have higher variance (less reliable).
- The regression gives less weight to high-variance points and more to low-variance points.
For example, older house sales might be unreliable due to inflation and market changes. We assign lower weights to older sales and higher weights to recent sales.
Aggregated Data
- If rows represent multiple cases (e.g., survey data), weights indicate how many observations each row represents.
- The model ensures that rows with higher counts contribute more.

Consider a scenario where we examine weighted regression in housing data. For instance, older house prices might be less reliable, prompting us to apply weights based on the recency of the sale, measured by the number of years since 2005.

In R

library(lubridate)
  
house$Year = year(house$DocumentDate)
  
# Compute weight (years since 2005)
house$Weight = house$Year - 2005

In Python

# Extract year from data
house['Year'] = [int(date.split('_')[0]) for date in house.DocumentDate]
  
# Compute weight (years since 2005)
house['Weight'] = house.year - 2005

Let’s fit a weighted regression model with the calculated weights.

In R

house_wt <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + 
               Bedrooms + BldgGrade, data=house, weight=Weight)
  
# Compare with standard regression
round(cbind(house_lm=house_lm$coefficients, house_wt=house_wt$coefficients), digits=3)

In Python, most models in scikit-learn support sample weights using the sample_weight argument.

from sklearn.linear_model import LinearRegression
  
# Define predictors and outcome
predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade']
outcome = 'AdjSalePrice'
  
# Initialize model
house_wt = LinearRegression()
  
# Fit weighted regression
house_wt.fit(house[predictors], house[outcome], sample_weight=house.Weight)
  
# Print coefficients
print("Coefficients:", house_wt.coef_)
print("Intercept:", house_wt.intercept_)

From the results, we can observe the main points as follows.

The coefficients change slightly in the weighted regression.
The effect of square footage (SqFtTotLiving) increases in weighted regression. This means that more recent sales show more substantial price effects.

When to Use Weighted Regression

Weighted regression is beneficial in several scenarios:

Data Reliability: When some data points are more trustworthy than others, such as newer sales figures, weighted regression enhances the analysis.
Aggregated Data: It is effective for summarized data, like survey results, where the figures represent overall totals rather than individual responses.
Variable Data Points: In cases where data points fluctuate significantly, such as experimental measurements, weighted regression can mitigate the impact of these inconsistencies.

By considering the trustworthiness of different data, weighted regression improves the accuracy of predictions.

Interactions and Main Effects

When interpreting the regression equation, statisticians often differentiate between main effects, which are independent variables, and the interactions among these main effects.

Main effects refer to the independent variables (predictors) that directly influence the dependent variable.

For example, in the King County Housing Data

SqFtTotLiving (House Size): Larger houses tend to have higher prices.
ZipGroup (Location): Houses in more expensive zip codes generally sell for more.
Bathrooms, Bedrooms, BldgGrade: These also affect house prices.

Main effects assume that each predictor affects the outcome independently. But in reality, variables often interact.

An interaction effect occurs when one predictor's impact depends on another predictor's level.

For example, the effect of house size on price depends on location.

A 1,000 sq ft increase in a high-end neighborhood might raise the price significantly.
The same increase in a low-cost neighborhood might have less impact.

To explore this further, we will add an interaction term between SqFtTotLiving and zipGroup.

$\text{AdjSalePrice} \sim \text{SqFtTotLiving} * \text{ZipGroup} + \text{other predictors}$

This expands into:

$\text{AdjSalePrice} \sim \text{SqFtTotLiving} + \text{ZipGroup} + (\text{SqFtTotLiving} \times \text{ZipGroup}) + \text{other predictors}$

In Python, we can fit this regression using statsmodels.

import statsmodels.formula.api as smf
  
# Fit the model with interaction
model = smf.ols(formula='AdjSalePrice ~ SqFtTotLiving*ZipGroup + SqFtLot +
                'Bathrooms + Bedrooms + BldgGrade + PropertyType')
  
results = model.fit()
print(results.summary())

SqFtTotLiving*ZipGroup automatically includes:
- The main effects: SqFtTotLiving and ZipGroup
- The interaction terms : SqFtTotliving:ZipGroup

From the regression output, we could observe the result as follows.

Variable	Estimate
SqFtTotLiving	114.8
ZipGroup2	-11,130
ZipGroup3	20,320
ZipGroup4	20,500
ZipGroup5	-149,900
SqFtTotLiving:ZipGroup2	32.6
SqFtTotLiving:ZipGroup3	41.8
SqFtTotLiving:ZipGroup4	69.3
SqFtTotLiving:ZipGroup5	226.7

The statsmodels package manages categorical variables (e.g., ZipGroup, PropertyType[T.Single Family]) and interaction terms (e.g., SqFtTotLiving:ZipGroup). There seems to be a significant interaction between location and house size. For homes in the lowest `ZipGroup`, the slope remains consistent with that of the main effect of SqFtTotLiving, which is $$114.8$ per square foot.

Conversely, for homes in the highest `ZipGroup`, the slope is the sum of the main effect and SqFtTotLiving:ZipGroup5, totaling $115 + $227, which equals $$342$ per square foot. This indicates that adding a square foot in the priciest zip code escalates the predicted sale price by nearly three times compared to the average increase from adding a square foot.

Importance of Interaction Effects

Avoid Misleading Conclusions: Relying solely on main effects can lead us to mistakenly believe that house size impacts price uniformly across different locations, which isn’t true.
Enhance Pricing Models: Real estate markets are significantly influenced by location; understanding interaction effects allows for more precise pricing models.
Inform Policy Decisions: For cities aiming to boost housing value, focusing on construction in high-demand areas is crucial.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day130 - STAT Review: Regression and Prediction (8)

Practical Statistics for Data Scientists: Weighted Regression, and Interactions and Main Effects in Regression in Depth

Weighted Regression

Why Use Weighted Regression

When to Use Weighted Regression

Interactions and Main Effects

Importance of Interaction Effects

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)