6 minute read

Practical Statistics for Data Scientists: Simple Linear Regression, Least Squares, and Multiple Linear Regression

296AC83E-3054-4848-97EF-47792996601B_1_105_c

The primary objective in statistics is to determine whether variable $X$ is related to variable $Y$, and if it is, to understand the nature of this relationship and its potential to predict $Y$. Statistics refers to data science grounded in the objective of forecasting an outcome (the target variable) using predictor variables.

This type of model training on known outcomes for future unknown data is called supervised learning. Additionally, statistics are firmly utilized in anomaly detection since regression diagnostics can identify unusual records when used for data analysis.

Simple Linear Regression

Simple linear regression offers a model that describes the relationship between the magnitude of one variable and another. For example, as X increases, Y also increases (or as X goes up, so does Y).

Correlation serves as another way to measure the relationship between two variables.

Key Terms For Simple Linear Regression

  • Response
    • The variable we aim to predict
    • = dependent variable, $Y$ variable, target, outcome
  • Independent Variable
    • The variable used to predict the outcome.
    • = $X$ variable, feature, attribute, predictor
  • Record
    • The vector of predictor and outcome values for a specific individual or case.
    • = row, case, instance, example
  • Intercept
    • The intercept of the regression line represents the predicted value when $X=0$.
    • = $b_0, \beta_{0}$
  • Regression Coefficient
    • The slope of the regression line.
    • = slope, $b_1$, $\beta_{1}$, parameter estimates, weights
  • Fitted Values
    • The difference between the observed values and the fitted values.
    • = errors
  • Least Squares
    • The method of fitting a regression by minimizing the sum of squared residuals.
    • = ordinary least squares, OLS

The Regression Equation

Simple linear regression estimates how much Y will change when X changes by a certain amount. With the correlation coefficient, the variable X and Y are interchangeable. With regression, we aim to predict the Y variable from X using a linear relationship as follows.

$Y = b_0 + b_1 X$

where,

  • $b_0$ : intercept (constant) and $b_1$: slope
  • In R, both are appeared as coefficients

Let’s say we have a dataset displaying the number of years a worker was exposed to cotton dust (Exposure), a measure of lung capacity(PEFR or peak expiratory flow rate). Then let’s draw a “best” line to predict the response PEFR as a function of the predictor variable Explosure,

  • In R, we utilize lm function to show a linear regression.

    model <- lm(PEFR ~ Exposure, data=lung)
    ---
    Call:
    lm(formula = PEFR ~ Exposure, data = lung)
      
    Coefficients:
    (Intercept)     Exposure
        424.583       -4.185
    

    The intercept $b_0$ is $424.583$, and can be interpreted as the predicted PEFR for a worker with zero years exposure. The regression coefficient $b1$ can be interpreted as follows: for each additional year that a worker is exposed to cotton dust, the worker’s PEFR measurement is reduced by $-4.185$.

  • In Python, LinearRegression from Scikit-learn package is used.

    predictors = ['Exposure']
    outcome = 'PEFR'
      
    model = LinearRegression()
    model.fit(lung[predictors], lung[outcome])
      
    print(f'Intercept: {model.intercept_:.3f}')
    print(f'Coefficient Exposure: {model.coef_[0]:.3f}')
    


    The result will be shown as follows.



Fitted Values and Residualss

In real-world, the data doesn’t fall exactly on the line, so the regression equation should include term $e_1$.

$Y_i = b_0 + b_1 X_i + e_i$

The fitted values (predicted values) are typically denoted by $\hat{Y}$. So, it will be updated as

$\hat{Y_i} = \hat{b_0} + \hat{b_1} X_i$

The notation $\hat{b_0}$ and $\hat{b_1}$ indicates the coefficients are estimated values.

We compute the residual with the equation below by subtracting the predicted values from the original data.

$\hat{e_i} = Y_i - \hat{Y_i}$

  • In R

    fitted <- predict(model)
    resid <- residuals(model)
    
  • In Python, we use LinearRegression model, predict function as follows.

    fitted = model.predict(lung[predictors])
    residuals = lung[outcome] - fitted
    


Least Squares

How is the model fit the data? To evaluate the model’s performance, we use “$RSS$” whici is the sum of squared residual values. Our goal of this test is to minimize it.

$\text{RSS} = \sum_{i=1} ^n \big( Y_i - \hat{Y_i}^2 \big) \\ =\sum_{i=1} ^n \big( Y_i - \hat{b_0} -\hat{b_1}X_i \big)^2$

The estimates $\hat{b_0}$ and $\hat{b_1}$ are the values that minimize RSS.

The method of minimizing the sum of squard residuals is termed “Least Sqaures Regression” or “Ordinary Least Squares” (OLS) regression.

With the advent of big data, computational speed is still an important factor. Least squares, like the mean, are sensitive to outliers, although this tends to be a significant problem only in small or moderate-sized data sets.

Prediction Versus Explanation (Profiling)

Regression historically focuses on identifying linear relationships between predictor variables and outcomes, aiming to understand and express these relationships through the data. The estimated slope of the regression equation is essential. Additionally, economists want to understand broader variable relationships by analyzing the connection between two variables — consumer spending and GDP growth- to uncover.

With the rise of big data’s rise, regression is often used to predict the modeling of new data rather than explain existing data. The main focus is on fitted to values $Y$. In marketing, regression forecasts revenue changes based on ad campaign size. Universities use regression to predict students’ GPA from SAT scores.

Multiple Linear Regression

For the multiple linear regression, the equation is extended as follows– the relationship between each coefficient and its variable (feature) is linear.

$Y = b_0+b_1X_1+b_2X_2+ \dots +b_p X_p + e$

Key Terms For Multiple Linear Regression

  • Root Mean Squared Error
    • The square root of the average squared error is the most widely used metric for comparing regression models.
    • = RMSE
  • Residual Standard Error
    • The same as the root mean squared error, but adjusted for degrees of freedom.
    • = RSE
  • R-Squared
    • The proportion of variance explained by the model varies between 0 and 1.
    • = coefficient of determination, $R^2$
  • t-Statistic
    • Dividing a predictor’s coefficient by its standard error offers a way to assess the significance of the variables in the model.

Example: King County Housing Data

An example of utilizing multiple linear regression is estimating house values. County assessors need to determine the value of a house for tax assessment purposes. Below is the example data of house data.frame.

head(house[, c('AdjSalePrice', 'SqFtTotLiving', 'SqFtLot', 'Bathrooms',
               'Bedrooms', 'BldgGrade')])
Source: local data frame [6 x 6]

  AdjSalePrice SqFtTotLiving SqFtLot Bathrooms Bedrooms BldgGrade
         (dbl)         (int)   (int)     (dbl)    (int)     (int)
1       300805          2400    9373      3.00        6         7
2      1076162          3764   20156      3.75        4        10
3       761805          2060   26036      1.75        4         8
4       442065          3200    8618      3.75        5         7
5       297065          1720    8620      1.75        4         7
6       411781           930    1012      1.50        2         8

The objective is to forecast the sales price based on the other variables.

  • In R

    The lm function effectively handles the multiple regression scenario by including additional terms on the right side of the equation; the argument na.action=na.omit directs the model to exclude records with missing values.

    house_lm <- lm(AdjSalePrice ~ SqFtToLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house, na.action=na.omit)
    
  • In Python

    scikit-learn’s LinearRegression can be used for multiple linear regression as well.

    predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade']
    outcome = 'AdjSalePrice'
      
    house_lm = Linear.Regression()
    house_lm.fit(house[predictors], house[outcome])
    

Each model’s results are as follows.

  • In R

    house_lm
    ---
    Call:
    lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms +
        Bedrooms + BldgGrade, data = house, na.action = na.omit)
      
    Coefficients:
      (Intercept)  SqFtTotLiving        SqFtLot      Bathrooms
       -5.219e+05      2.288e+02     -6.047e-02     -1.944e+04
         Bedrooms      BldgGrade
       -4.777e+04      1.061e+05
    
  • In Python

    For a LinearRegression model, the intercept and coefficients are represented by the fields intercept_ and coef_ of the fitted model.

    print(f'Intercept: {house_lm.intercept_:.3f}')
    print('Coefficients':)
    for name, coef in zip(predictors, house_lm.coef_):
      print(f' {name}: {coef}')
    

    The interpretation of the coefficients is as with simple linear regression: the predicted value $\hat{Y}$ changes by the coefficient $b_j$ for each unit change in $X_j$ assuming all the other variables, $X_k$ for $k \neq j$, remaithe same.

    The interpretation of the coefficients follows the same principle as in simple linear regression: the predicted value $\hat{Y}$ changes by the coefficient $b_j$ for each unit change in $X_j$, assuming all the other variables, $X_k$ for $k \neq j$, remain the same.

Leave a comment