Day134 - STAT Review: Classification (4)

March 5, 2025 4 minute read

Practical Statistics for Data Scientists: Logistic Regression (2) (GLM, Interpretation, Fitting the Model)

EFA0B434-2932-44F7-914B-E4CA95F0AD4F_1_105_c

Logistic Regression and the GLM

In the logistic regression formula, the response represents the log odds of a binary outcome of 1. Since we only observe the binary outcome rather than the log odds, specialized statistical methods are required to fit the equation.

Logistic regression is a specific case of a generalized linear model (GLM) that was created to broaden the applications of linear regression.

What are GLMs?

Generalized Linear Models (GLMs) extend linear regression by introducing:

A probability distribution (family)
- Binomial (for logistic regression)
- Poisson (for count data)
- Gamma (for modeling time to failure)
A link function
- Logit function for logistic regression.
- Log function for Poisson regression.

Logistic regression is the most common type of GLM, where:

The response variable follows a binomial distribution (0 or 1).
The logit link function transforms probabilities into log-odds.

Other types of GLM in practice as follows.

Poisson Regression: Used for count data (e.g., number of website visits).
Negative Binomial Regression: Handles overdispersed count data.
Gamma Regression: Used for modeling continuous positive responses like time-to-failure.

Unlike logistic regression, using GLMs with these models requires a more nuanced approach and extra caution. It’s advisable to steer clear of them unless you are well-acquainted with their utility and potential drawbacks.

Predicted Values from Logistic Regression

Logistic regression’s predicted value is expressed in log odds : $\hat{Y} = \log(\text{Odds}(Y=1))$. The logistic response gives the predicted probability function:

$\hat{p} = \frac{1}{1+e^{-\hat{Y}}}$

For example, when we look at the predictions from the model logistic+model in R,

pred <- predict(logistic_model)
summary(pred)
---
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-2.704774 -0.518825 -0.008539  0.002564  0.505061  3.509606

In Python, we can convert the probabilities into a data frame and use the describe method to get these characteristics of the distribution:

pred = pd.DataFrame(logit_reg.predict_log_proba(X), 
                    columns=loan_data[outcome].cat.categories)
pred.describe()

Converting these values to probabilities is a simple transform:

prob <- 1/(1 + exp(-pred))
> summary(prob)
---
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.06269 0.37313 0.49787 0.50000 0.62365 0.97096

The probabilities are directly available using the predict_proba methods in scikit-learn:

pred = pd.DataFrame(logit_reg.predict_log_proba(X),
                    columns=loan_data[outcome].cat.categories)
pred.describe()

These values range from 0 to 1 and do not indicate whether the predicted value has been defaulted on or paid off. We could set any value above 0.5 as default. However, a lower threshold is usually better when the aim is to find members of a rare class.

Interpreting the Coefficients and Odds Ratios

One of the key strengths of logistic regression is that it generates a model that can be quickly evaluated on new data without requiring recomputation. Another benefit is the model’s interpretability in comparison to other classification methods.

The key concept is understanding the odds ratio, which is easiest to grasp for a binary variable $X$, as follows.

$\text{odds ratio} = \frac{\text{Odds}(Y=1 \vert X=1)}{\text{Odds}(Y=1 \vert X=0)}$

This is interpreted as the odds that $Y=1$ when $X=1$ versus the odds that $Y=1$ when $X=0$.

If the odds ratio is $2$, then the odds that $Y=1$ are two times higher when $X=1$ versus when $X=0$.

Odds: The ratio of the probability that an event happens to the probability that it doesn’t.

\[\text{Odds} = \frac{p}{1 - p}\]

Odds Ratio (OR): It compares the odds for two different values of a binary variable.
- For a binary variable $X \in {0, 1}$:

\[\text{OR} = \frac{\text{Odds}(Y = 1 \mid X = 1)}{\text{Odds}(Y = 1 \mid X = 0)} = e^{\beta}\]

If $\beta = 0$, then $\text{OR} = 1$ → no effect
If $\beta > 0$, then $\text{OR} > 1$ → predictor increases the odds
If $\beta < 0$, then $\text{OR} < 1$ → predictor decreases the odds

Why do we use odds ratios (and not probabilities)? In logistic regression, the model predicts log-odds, which is the logarithm of the odds. This is because a linear model can’t be constrained to outputs between 0 and 1, but the log-odds scale stretches across all real numbers:

$\log\left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 X_1 + \dots + \beta_q X_q$

By exponentiating both sides, we get:

$\frac{p}{1 - p} = e^{\beta_0 + \beta_1 X_1 + \dots + \beta_q X_q}$

This gives us the odds, which can easily be interpreted as multiplicative effects.

For example, suppose the coefficient for purpose_small_business is 1.21526. This means:

$\text{Odds Ratio} = e^{1.21526} \approx 3.37$

This tells us:

A loan for a small business is 3.37 times more likely to default than a loan used to pay off credit card debt (the ference category).
That’s a big difference – small business loans are riskier.

For numeric variables (e.g., payment_inc_ratio):

A unit increase in the predictor increases the odds of the outcome by a factor of $e^{\beta}$.

Example: If payment_inc_ratio has $\beta = 0.0797$, then:

$\text{OR} = e^{0.0797} \approx 1.083$

→ A unit increase in payment-to-income ratio increases odds of defaulting by 8.3%.

Logistic vs Linear Regression

Feature	Linear Regression	Logistic Regression
Outcome	Continuous	Binary (0 or 1)
Model Fit	Least Squares	Maximum Likelihood Estimation (MLE)
Output	Continuous prediction (y)	Probability (between 0 and 1)
Interpretation	Coefficients = effect on outcome	Coefficients = effect on log-odds
Evaluation Metrics	RMSE, $R^2$	Accuracy, Precision, AUC, Log Loss

How is Logistic Regression Fitted?

Maximum Likelihood Estimation (MLE)

Since the response variable is binary, we can’t use least squares. Instead, MLE is used:

MLE tries to maximize the likelihood of observing the data given the model.
It iteratively updates the parameters to increase the fit, often using optimization techniques like Newton-Raphson or Fisher Scoring.

Handling Factor (Categorical) Variables

In R: categorical variables are handled using reference encoding. For a variable with $k$ levels, R creates $k-1$ dummy variables, and compares them to the reference level.
In Python: use pd.get_dummies() to convert to one-hot encoding. Just make sure to drop one level to avoid multicollinearity (e.g., drop_first=True).

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day134 - STAT Review: Classification (4)

Practical Statistics for Data Scientists: Logistic Regression (2) (GLM, Interpretation, Fitting the Model)

Logistic Regression and the GLM

What are GLMs?

Predicted Values from Logistic Regression

Interpreting the Coefficients and Odds Ratios

Logistic vs Linear Regression

How is Logistic Regression Fitted?

Maximum Likelihood Estimation (MLE)

Handling Factor (Categorical) Variables

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)