4 minute read

Practical Statistics for Data Scientists: Weighted Regression, and Interactions and Main Effects in Regression in Depth

0C663914-0BC5-4CB1-B796-EECDB6AFAD47_1_105_c

Weighted Regression

Weighted regression is a variation of ordinary least squares (OLS) regression where each data point is assigned a weight.

Weighted regression is used when:

  • Some observations have higher precision and should be given more importance.
  • Rows represent aggregated cases, where some observations count more than others.

Instead of treating all observations equally, weighted regression assigns greater importance to more reliable observations.

Why Use Weighted Regression

There are two everyday use cases.

  • Inverse-Variance Weighting

    • Some observations have higher variance (less reliable).
    • The regression gives less weight to high-variance points and more to low-variance points.

    For example, older house sales might be unreliable due to inflation and market changes. We assign lower weights to older sales and higher weights to recent sales.

  • Aggregated Data

    • If rows represent multiple cases (e.g., survey data), weights indicate how many observations each row represents.
    • The model ensures that rows with higher counts contribute more.

Consider a scenario where we examine weighted regression in housing data. For instance, older house prices might be less reliable, prompting us to apply weights based on the recency of the sale, measured by the number of years since 2005.

  • In R

    library(lubridate)
      
    house$Year = year(house$DocumentDate)
      
    # Compute weight (years since 2005)
    house$Weight = house$Year - 2005
    
  • In Python

    # Extract year from data
    house['Year'] = [int(date.split('_')[0]) for date in house.DocumentDate]
      
    # Compute weight (years since 2005)
    house['Weight'] = house.year - 2005
    


Let’s fit a weighted regression model with the calculated weights.

  • In R

    house_wt <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + 
                   Bedrooms + BldgGrade, data=house, weight=Weight)
      
    # Compare with standard regression
    round(cbind(house_lm=house_lm$coefficients, house_wt=house_wt$coefficients), digits=3)
    
  • In Python, most models in scikit-learn support sample weights using the sample_weight argument.

    from sklearn.linear_model import LinearRegression
      
    # Define predictors and outcome
    predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade']
    outcome = 'AdjSalePrice'
      
    # Initialize model
    house_wt = LinearRegression()
      
    # Fit weighted regression
    house_wt.fit(house[predictors], house[outcome], sample_weight=house.Weight)
      
    # Print coefficients
    print("Coefficients:", house_wt.coef_)
    print("Intercept:", house_wt.intercept_)
    


From the results, we can observe the main points as follows.

  • The coefficients change slightly in the weighted regression.
  • The effect of square footage (SqFtTotLiving) increases in weighted regression. This means that more recent sales show more substantial price effects.

When to Use Weighted Regression

Weighted regression is beneficial in several scenarios:

  1. Data Reliability: When some data points are more trustworthy than others, such as newer sales figures, weighted regression enhances the analysis.
  2. Aggregated Data: It is effective for summarized data, like survey results, where the figures represent overall totals rather than individual responses.
  3. Variable Data Points: In cases where data points fluctuate significantly, such as experimental measurements, weighted regression can mitigate the impact of these inconsistencies.

By considering the trustworthiness of different data, weighted regression improves the accuracy of predictions.

Interactions and Main Effects

When interpreting the regression equation, statisticians often differentiate between main effects, which are independent variables, and the interactions among these main effects.

Main effects refer to the independent variables (predictors) that directly influence the dependent variable.

For example, in the King County Housing Data

  • SqFtTotLiving (House Size): Larger houses tend to have higher prices.
  • ZipGroup (Location): Houses in more expensive zip codes generally sell for more.
  • Bathrooms, Bedrooms, BldgGrade: These also affect house prices.

Main effects assume that each predictor affects the outcome independently. But in reality, variables often interact.


An interaction effect occurs when one predictor's impact depends on another predictor's level.

For example, the effect of house size on price depends on location.

  • A 1,000 sq ft increase in a high-end neighborhood might raise the price significantly.
  • The same increase in a low-cost neighborhood might have less impact.

To explore this further, we will add an interaction term between SqFtTotLiving and zipGroup.

$\text{AdjSalePrice} \sim \text{SqFtTotLiving} * \text{ZipGroup} + \text{other predictors}$

This expands into:

$\text{AdjSalePrice} \sim \text{SqFtTotLiving} + \text{ZipGroup} + (\text{SqFtTotLiving} \times \text{ZipGroup}) + \text{other predictors}$

  • In Python, we can fit this regression using statsmodels.

    import statsmodels.formula.api as smf
      
    # Fit the model with interaction
    model = smf.ols(formula='AdjSalePrice ~ SqFtTotLiving*ZipGroup + SqFtLot +
                    'Bathrooms + Bedrooms + BldgGrade + PropertyType')
      
    results = model.fit()
    print(results.summary())
    
    • SqFtTotLiving*ZipGroup automatically includes:
      • The main effects: SqFtTotLiving and ZipGroup
      • The interaction terms : SqFtTotliving:ZipGroup

From the regression output, we could observe the result as follows.

Variable Estimate  
SqFtTotLiving 114.8  
ZipGroup2 -11,130  
ZipGroup3 20,320  
ZipGroup4 20,500  
ZipGroup5 -149,900  
SqFtTotLiving:ZipGroup2 32.6  
SqFtTotLiving:ZipGroup3 41.8  
SqFtTotLiving:ZipGroup4 69.3  
SqFtTotLiving:ZipGroup5 226.7  

The statsmodels package manages categorical variables (e.g., ZipGroup, PropertyType[T.Single Family]) and interaction terms (e.g., SqFtTotLiving:ZipGroup). There seems to be a significant interaction between location and house size. For homes in the lowest `ZipGroup`, the slope remains consistent with that of the main effect of SqFtTotLiving, which is $$114.8$ per square foot.

Conversely, for homes in the highest `ZipGroup`, the slope is the sum of the main effect and SqFtTotLiving:ZipGroup5, totaling $115 + $227, which equals $$342$ per square foot. This indicates that adding a square foot in the priciest zip code escalates the predicted sale price by nearly three times compared to the average increase from adding a square foot.

Importance of Interaction Effects

  • Avoid Misleading Conclusions: Relying solely on main effects can lead us to mistakenly believe that house size impacts price uniformly across different locations, which isn’t true.
  • Enhance Pricing Models: Real estate markets are significantly influenced by location; understanding interaction effects allows for more precise pricing models.
  • Inform Policy Decisions: For cities aiming to boost housing value, focusing on construction in high-demand areas is crucial.

Leave a comment