Day131 - STAT Review: Classification (1)

March 1, 2025 6 minute read

Practical Statistics for Data Scientists: Naive Bayes (Theoretical Approach, Code Source, & Prediction)

71C375E7-B903-466E-9CBD-E378E03BF022_1_105_c

Classification

Data scientists frequently seek to automate decision-making for various business challenges—such as identifying whether emails are phishing, predicting whether customers will churn, and assessing whether web users are likely to click on an ad. All these scenarios represent classification problems, a type of supervised learning in which we first train a model on data with known outcomes and then apply that model to data with unknown outcomes.

Classification is a crucial prediction method that determines if a record is a 1 or 0 (phishing/not-phishing, click/don’t click, churn/don’t churn). It can also be categorized into multiple groups, like Gmail’s sorting into “primary,” “social,” “promotional,” or “forums.”

We often seek more than binary classification; we want the predicted probability of a case belonging to a class. Many algorithms return a probability score instead of just a binary classification. In Python’s scikit-learn, logistic regression offers two prediction methods: predict (returns the class) and predict_proba (returns class probabilities). A sliding cutoff can convert the score to a decision.

The general approach is as follows.

Set a cutoff probability for the class of interest, above which we regard a record as belonging to that class.
Using any model, estimate the likelihood that a record belongs to the class of interest.
If the probability surpasses the cutoff threshold, assign the new record to the class of interest.

Naïve Bayes

The Naïve Bayes is a probabilistic classification algorithm that uses Bayes’ Theorem to predict the probability of a class given a set of predictor values. It assumes that predictor variables are conditionally independent given the outcome.

It is called “naïve” because it assumes that the predictors are independent, which is often unrealistic in real-world data. However, despite this simplification, Naïve Bayes works surprisingly well in many classification problems.

Key Terms for Naive Bayes

Conditional Probability
- The probability of observing a particular event (for example, $X = i$) given another event (for instance, $Y = i$), is written as $P(X_i \vert Y_i).$
Posterior Probability
- This refers to the probability of an outcome after incorporating predictor information, in contrast to the prior probability of outcomes that does not take predictor information into account.

Key Probability Concepts for Naïve Bayes

To understand how Naïve Bayes works, we fisrt need to define some important probability terms.

Conditional Probability
- The probability of an event occuring, given that another event has already occured.
  $P(A \vert B) = \frac{P(A \cap B)}{P(B)}$
- For example, P(Spam $\vert$ Contains “Buy now”) is the probability that an email is spam, given that it contains the phrase “Buy now.”
Bayes’ Theorem
- Bayes’ theorem is the fundamental formula behind Naïve Bayes Classification.
  $P(Y \vert X) = \frac{P(X \vert Y) \cdot P(Y)}{P(X)}$
- Where:
  - $P(Y \vert X)$ = Posterior probability (probability of class $Y$ given predictors $X$) / the probability of event $Y$ occurring given that $X$ is true.
  - $P(X \vert Y)$ = Likelihood (probability of predictor values given class $Y$) / the probability of event $X$ occuring given that $Y$ is true.
  - $P(Y)$ = Prior probability (probability of class $Y$ before seeing predictor values)
  - $P(X)$ = Evidence (overall probability of the predictor values)
- In simpler terms, we calculate the probability of a class after seeing predictor values.

How Does Naïve Bayes Work?

Exact Bayesian Classification (Theoretical Approach)
- If we had perfect data, the ideal way to classify a new record would be: 1) Find all records in the dataset that match the predictor values exactly. 2) Count how often each outcome occurs in these matched records. 3) Assign the most frequent class to the new record.
But this approach is impractical.
- With many predictor variables, exact matches become very rare.
- Even in a large dataset, there might be no exact matches for some new observations.

The Naïve Bayes Approximation

To make Bayesian classification practical, Naïve Bayes makes a simplifying assumption: Each predictor is treated as independent, so we calculate probabilities separately for each predictor and then combine them.

Compute the conditional probability of each predictor variable for each class.
$P(X_j | Y = i)$
Multiply these probabilities together, assuming independence.
$P(X_1 | Y = i) \times P(X_2 | Y = i) \times \dots \times P(X_n | Y = i)$
Multiply by the prior probability of the class.
$P(Y = i) \times \left( P(X_1 | Y = i) \times P(X_2 | Y = i) \times ... \times P(X_n | Y = i) \right)$
Normalize by dividing by the total probability across all classes.
Assign the most probable class.

The Naïve Bayes Formula

The full probability calculation for classifying a record is:

$P(Y = i \vert X_1, X_2, ..., X_n) = \frac{P(Y = i) \cdot \prod_{j=1}^{n} P(X_j \vert Y = i)}{\sum_{k} P(Y = k) \cdot \prod_{j=1}^{n} P(X_j \vert Y = k)}$

Instead of searching for exact matches, Naïve Bayes estimates probabilities for each predictor separately and combines them to make a final classification.

For example, Let’s say we want to classify whether a loan will be paid off or defaulted based on three categorical variables:

Purpose of loan (e.g., debt consolidation, home improvement)
Home ownership (e.g., renting, mortgage)
Employment length (e.g., <1 year, >1 year)

In R

library(klaR)
  
# Train Naïve Bayes Model
naive_model <- NaiveBayes(outcome ~ purpose_ + home_ + emp_len_,
                          data = na.omit(loan_data))
  
# View Conditional Probabilities
naive_model$table

In Python

from sklearn.naive_bayes import MultinomialNB
import pandas as pd
  
# Prepare Data (Convert Categorical Variables)
predictors = ['purpose_', 'home_', 'emp_len_']
outcome = 'outcome'
X = pd.get_dummies(loan_data[predictors], prefix='', prefix_sep='')
y = loan_data[outcome]
  
# Train Naïve Bayes Model
naive_model = MultinomialNB(alpha=0.01, fit_prior=True)
naive_model.fit(X, y)

The model computes conditional probabilities for each predictor, then combines them to predict whether the loan will be paid off or defaulted.

Making Predictions With Naïve Bayes

Predicting A New Loan Application

Suppose a new loan application has:

Purpose: Small Business
Home Ownership: Mortgage
Employment Length: > 1 Year

Let’s apply Naïve Bayes to predict the probability of each class.

# Extract New Loan Features
new_loan = X.loc[146:146, :]

# Predict Class
print('Predicted Class:', naive_model.predict(new_loan)[0])

# Get Class Probabilities
probabilities = pd.DataFrame(naive_model.predict_proba(new_loan),
                             columns=loan_data[outcome].cat.categories)
print('Predicted Probabilities:', probabilities)

The example output will be as follows.

---
Predicted Class: default
Predicted Probabilities:
    default  paid off
0  0.653696  0.346304

The likelihood of the borrower defaulting is 65.4%. Conversely, there is a 34.6% chance that the borrower will repay the loan. The model assigns the most likely outcome as default.

Handling Numeric Predictor Variables

Naïve Bayes originally works only with categorical predictors, but we can adapt it for numerical variables using:

Binning (convert numerical values into categories, e.g., income groups).
Gaussian Naïve Bayes (assume numerical predictors follow a normal distribution).

In Python, Gaussian Naïve Bayes is applied as follows.

from sklearn.naive_bayes import GaussianNB
  
# Train Gaussian Naïve Bayes (For Continuous Variables)
gnb = GaussianNB()
gnb.fit(X, y)

The interpretation of the limitations of Naïve Bayes reveals specific scenarios in which the model may perform poorly.

One significant issue arises with strongly correlated predictors, which violate the independence assumption fundamental to Naïve Bayes.
Additionally, the model struggles with rare categories; if a predictor category does not appear in the training data, the model assigns it a zero probability.

This limitation can be addressed through the use of Laplace Smoothing, which helps to mitigate the impact of unseen categories during the prediction process.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin