Day139 - STAT Review: Statistical Machine Learning (1)

March 15, 2025 5 minute read

Practical Statistics for Data Scientists: K-Nearest Neighbors (1) (Example, Distance Metrics, and One Hot Encoder)

C2BE6DF8-A21E-42BF-86E8-12CD4D02F3CC_1_105_c

What Is Statistical Machine Learning?

Statistical Machine Learning refers to data-driven, automated methods used for prediction tasks like classification (e.g., “yes” vs. “no”) and regression (predicting a continuous value).

These techniques differ from classical statistics, which focuses on explaining relationships through fixed models (like linear regression).
Instead, statistical machine learning methods let the data speak for itself without assuming a fixed structure (like linearity).

For examples,

K-Nearest Neighbors (KNN): Predict by looking at similar data points.
Ensemble Learning + Decision Trees: Use multiple decision trees together (like Random Forests and Boosting) to get strong predictive results.

Machine Learning vs. Statistics

Machine Learning	Statistics
Focuses on performance & scalability	Focuses on understanding & inference
Optimizes for predictive accuracy	Explains underlying data structure
Example: Boosting, Neural Nets	Example: Generalized Linear Models

There’s no hard line between them—both communities contribute to techniques like bagging and boosting.

K-Nearest Neighbors (KNN)

KNN is a simple yet powerful technique. It doesn’t build a model—it predicts directly from the data by comparing a new point to nearby points in the training data.

How It Works:
1. For a new data point, find $K$ nearest neighbors (based on a distance metric).
2. If classification: assign the majority class among those neighbors.
3. If regression: average the target values of those neighbors.

Key Terms for K-Nearest Neighbors

Neighbor
- A record that has similar predictor values to another record.
Distance Metrics
- Measures that sum up in a single number how far one record is from another.
Standardization
- Subtract the mean and divide by the standard deviation.
- = Normalization
z-score
- The value that results after standardization.
- $\frac{x-\bar{x}}{s}$
$K$
- The number of neighbors considered in the nearest neighbor calculation.

Important Considerations:

All features must be numeric.
Feature scaling is crucial, or large-valued variables dominate.
Value of K matters: small K = more noise; large K = more smoothing.

KNN is a simple classification method that doesn't require model fitting like regression. However, its application can be complex due to feature scaling, similarity measurement, and the chosen value of K. We will demonstrate KNN with a classification example.

Example: Predicting Loan Default with KNN

In this example:

Predict if a loan will default or be paid off.
Features: debt-to-income ratio (dti) and loan-payment-to-income ratio (payment_inc_ratio).

Both ratios are multiplied by 100. For a small set of 200 loans, designated as loan200, which includes known binary outcomes (either default or no-default, defined by the predictor outcome200), we set $K = 20$. The KNN estimate for predicting a new loan, referred to as newloan, with a dti of 22.5 and a payment_inc_ratio of 9 can be calculated in R as follows:

In R

newloan <- loan200[1, 2:3, drop=FALSE]
knn_pred <- knn(train=loan200[-1, 2:3], test=newloan, cl=loan200[-1, 1], k=20)
knn_pred == 'paid off'
---
[1] TRUE

While R has a native knn function, the contributed R package FNN, for Fast Nearest Neighbor, scales more effectively to big data and provides more flexibility.

In Python, Scikit-learn offers a fast KNN implementation.

predictors = ['payment_inc_ratio', 'dti']
outcome = 'outcome'
  
newloan = loan200.loc[0:0, predictors]
X = loan200.loc[1:, predictors]
y = loan200.loc[1:, outcome]
  
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X, y)
knn.predict(newloan)

The figure below gives a visual display of this example. The predicted loan is the middle cross. Squares (paid off) and circles (default) represent training data. The large black circle indicates the boundary of the nearest 20 points, containing 9 defaulted loans and 11 paid-off loans. Thus, the predicted outcome is that the loan is paid off. However, considering only three nearest neighbors, the prediction would be a default.

Majority is paid off, so KNN predicts paid off
But we also get a probability: $\text{Probability of default} = \frac{9}{20} = 0.45$

Employing a probability score enables the use of classification rules beyond simple majority votes (a probability of 0.5). This is particularly crucial in scenarios involving imbalanced classes.

KNN is a powerful baseline algorithm. It predicts new values based on similarity to previous examples. But its success heavily depends on data preprocessing, especially scaling, choosing K wisely, and handling class imbalance properly.

Distance Metrics

Similarity, or nearness, is determined using a distance metric- a function that assesses the separation between two records, ($x_1$, $x_2$, …, $x_p$) and ($u_1$, $u_2$, …, $u_p$) .

The Euclidean distance is the most widely used metric for comparing two vectors, thinking of it as a straight line (“as the crow flies”) between two points in space. To calculate the Euclidean distance, subtract one vector from the other, square the resulting differences, sum these squared values, and then take the square root of that sum: $\text{distance} = \sqrt{(x_1 - u_1)^2 + (x_2 - u_2)^2 + \dots + (x_p - u_p)^2}$

Another common distance metric for numeric data is Manhattan distance, measuring distance by summing absolute differences: “block by block,” like driving on a city grid.$\text{distance} = \vert x_1 - u_1\vert + \vert x_2 - u_2 \vert + \dots + \vert x_p - u_p\vert$ . We use this when the movement is restricted to horizontal/vertical (like cities or grids), or we want to be more tolerant of outliers.

Many metrics measure the distance between vectors. For numeric data, Mahalanobis distance is useful as it considers correlation between variables, treating highly correlated variables as one in terms of distance. But this metric is more complex — needs to calculate the covariance matrix, which makes it computationally heavier than Euclidean or Manhattan.

Measuring distance between two vectors can be skewed by features on a large scale, such as income and loan amounts in loan data, typically in tens or hundreds of thousands. Ratio variables contribute little in comparison.

One Hot Encoder

Most statistical and machine learning models require factor (string) variable to be converted to a series of binary dummy variables conveying the same information

Instead of using a single variable for home occupant status like “owns with a mortgage,” we create four binary variables: “owns with a mortgage—Y/N,” “owns with no mortgage—Y/N,” “ rents- Y/N, “ and “ other- Y/N. “ This predictor generates a vector with one 1 and three 0s, useful for statistical and machine learning algorithms. The term one hot encoding originates from digital circuit terminology, describing configurations where only one bit is positive (hot).

Special Note for Regression Models:

One-hot encoding can cause multicollinearity (a problem where one feature can be predicted from others).
So we drop one column to avoid redundancy.
Example: if you know the other 3 values, you can figure out the 4th.

But this is not a problem for KNN, because KNN doesn’t fit a linear equation—so you can keep all dummy columns.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin