Day134 - STAT Review: Classification (4)
Practical Statistics for Data Scientists: Logistic Regression (2) (GLM, Interpretation, Fitting the Model)
Logistic Regression and the GLM
In the logistic regression formula, the response represents the log odds of a binary outcome of 1. Since we only observe the binary outcome rather than the log odds, specialized statistical methods are required to fit the equation.
Logistic regression is a specific case of a generalized linear model (GLM) that was created to broaden the applications of linear regression.
What are GLMs?
Generalized Linear Models (GLMs) extend linear regression by introducing:
- A probability distribution (family)
- Binomial (for logistic regression)
- Poisson (for count data)
- Gamma (for modeling time to failure)
- A link function
- Logit function for logistic regression.
- Log function for Poisson regression.
Logistic regression is the most common type of GLM, where:
- The response variable follows a binomial distribution (0 or 1).
- The logit link function transforms probabilities into log-odds.
Other types of GLM in practice as follows.
- Poisson Regression: Used for count data (e.g., number of website visits).
- Negative Binomial Regression: Handles overdispersed count data.
- Gamma Regression: Used for modeling continuous positive responses like time-to-failure.
Unlike logistic regression, using GLMs with these models requires a more nuanced approach and extra caution. It’s advisable to steer clear of them unless you are well-acquainted with their utility and potential drawbacks.
Predicted Values from Logistic Regression
Logistic regression’s predicted value is expressed in log odds : $\hat{Y} = \log(\text{Odds}(Y=1))$. The logistic response gives the predicted probability function:
For example, when we look at the predictions from the model logistic+model
in R,
pred <- predict(logistic_model)
summary(pred)
---
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.704774 -0.518825 -0.008539 0.002564 0.505061 3.509606
In Python, we can convert the probabilities into a data frame and use the describe
method to get these characteristics of the distribution:
pred = pd.DataFrame(logit_reg.predict_log_proba(X),
columns=loan_data[outcome].cat.categories)
pred.describe()
Converting these values to probabilities is a simple transform:
prob <- 1/(1 + exp(-pred))
> summary(prob)
---
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.06269 0.37313 0.49787 0.50000 0.62365 0.97096
The probabilities are directly available using the predict_proba
methods in scikit-learn
:
pred = pd.DataFrame(logit_reg.predict_log_proba(X),
columns=loan_data[outcome].cat.categories)
pred.describe()
These values range from 0 to 1 and do not indicate whether the predicted value has been defaulted on or paid off. We could set any value above 0.5 as default. However, a lower threshold is usually better when the aim is to find members of a rare class.
Interpreting the Coefficients and Odds Ratios
One of the key strengths of logistic regression is that it generates a model that can be quickly evaluated on new data without requiring recomputation. Another benefit is the model’s interpretability in comparison to other classification methods.
The key concept is understanding the odds ratio, which is easiest to grasp for a binary variable $X$, as follows.
This is interpreted as the odds that $Y=1$ when $X=1$ versus the odds that $Y=1$ when $X=0$.
If the odds ratio is $2$, then the odds that $Y=1$ are two times higher when $X=1$ versus when $X=0$.
- Odds: The ratio of the probability that an event happens to the probability that it doesn’t.
- Odds Ratio (OR): It compares the odds for two different values of a binary variable.
- For a binary variable $X \in {0, 1}$:
- If $\beta = 0$, then $\text{OR} = 1$ → no effect
- If $\beta > 0$, then $\text{OR} > 1$ → predictor increases the odds
- If $\beta < 0$, then $\text{OR} < 1$ → predictor decreases the odds
Why do we use odds ratios (and not probabilities)? In logistic regression, the model predicts log-odds, which is the logarithm of the odds. This is because a linear model can’t be constrained to outputs between 0 and 1, but the log-odds scale stretches across all real numbers:
By exponentiating both sides, we get:
This gives us the odds, which can easily be interpreted as multiplicative effects.
For example, suppose the coefficient for purpose_small_business
is 1.21526. This means:
This tells us:
- A loan for a small business is 3.37 times more likely to default than a loan used to pay off credit card debt (the ference category).
- That’s a big difference – small business loans are riskier.
For numeric variables (e.g., payment_inc_ratio
):
- A unit increase in the predictor increases the odds of the outcome by a factor of $e^{\beta}$.
Example: If payment_inc_ratio
has $\beta = 0.0797$, then:
→ A unit increase in payment-to-income ratio increases odds of defaulting by 8.3%.
Logistic vs Linear Regression
Feature | Linear Regression | Logistic Regression |
---|---|---|
Outcome | Continuous | Binary (0 or 1) |
Model Fit | Least Squares | Maximum Likelihood Estimation (MLE) |
Output | Continuous prediction (y) | Probability (between 0 and 1) |
Interpretation | Coefficients = effect on outcome | Coefficients = effect on log-odds |
Evaluation Metrics | RMSE, $R^2$ | Accuracy, Precision, AUC, Log Loss |
How is Logistic Regression Fitted?
Maximum Likelihood Estimation (MLE)
Since the response variable is binary, we can’t use least squares. Instead, MLE is used:
- MLE tries to maximize the likelihood of observing the data given the model.
- It iteratively updates the parameters to increase the fit, often using optimization techniques like Newton-Raphson or Fisher Scoring.
Handling Factor (Categorical) Variables
- In R: categorical variables are handled using reference encoding. For a variable with $k$ levels, R creates $k-1$ dummy variables, and compares them to the reference level.
- In Python: use
pd.get_dummies()
to convert to one-hot encoding. Just make sure to drop one level to avoid multicollinearity (e.g.,drop_first=True
).
Leave a comment