Day140 - STAT Review: Statistical Machine Learning (2)

March 16, 2025 4 minute read

Practical Statistics for Data Scientists: K-Nearest Neighbors (2) (Standardization & Choosing K)

7C599E40-711F-48C5-8D52-DAF3A4D8B2D4_1_105_c

Standardization

In measurement, we are often more interested in “how different from the average “ rather than “ how much.” Standardization, also known as normalization, places all variables on similar scales by subtracting the mean and dividing by the standard deviation. This ensures that a variable does not disproportionately influence a model simply due to the scale of its original measurement.

The formula is: $z = \frac{x - \bar{x}}{s}$ Where:

$x$ = the original value
$\bar{x}$ = the mean of the feature
$s$ = the standard deviation

This gives us the z-score, which tells us how many standard deviations a value is from the mean.

Why is standardization necessary? Some features have larger numeric scales than others, which can dominate distance-based algorithms such as K-Nearest Neighbors (KNN).

In the loan example:

revol_bal is in dollars (e.g., 1,687)
payment_inc_ratio might be single digits (e.g., 2.3)

So when KNN calculates distance:

It mostly notices differences in revol_bal (the total revolving credit available to the applicant in dollars)
And ignores the more important small-scale features like `dti` and `revol_util` (the percent of the credit being used)

newloan
---
  payment_inc_ratio dti revol_bal revol_util
1            2.3932   1      1687        9.4

The dollar value of revol_bal is considerably higher than the other variables. The knn function provides the index of the nearest neighbors through nn.index, allowing us to display the top five closest rows in loan_df.

In R

loan_df <- model.matrix(~ -1 + payment_inc_ratio + dti + revol_bal +
                          revol_util, data=loan_data)
newloan <- loan_df[1, , drop=FALSE]
loan_df <- loan_df[-1,]
outcome <- loan_data[-1, 1]
knn_pred <- knn(train=loan_df, test=newloan, cl=outcome, k=5)
loan_df[attr(knn_pred, "nn.index"),]
---
        payment_inc_ratio  dti revol_bal revol_util
35537             1.47212 1.46      1686       10.0
33652             3.38178 6.37      1688        8.4
25864             2.36303 1.39      1691        3.5
42954             1.28160 7.14      1684        3.9
43600             4.12244 8.98      1684        7.2

In Python, following the model fit, we can use the kneighbors method to identify the five closest rows in the training set with scikit-learn.

from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
  
predictors = ['payment_inc_ratio', 'dti', 'revol_bal', 'revol_util']
outcome = 'outcome'
  
newloan = loan_data.loc[0:0, predictors]
X = loan_data.loc[1:, predictors]
y = loan_data.loc[1:, outcome]
  
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
  
nbrs = knn.kneighbors(newloan)
X.iloc[nbrs[1][0], :]

The revol_bal values in these neighbors are similar to the new record, but other predictor variables vary widely and do not significantly influence neighbor determination.

Compare this to KNN on standardized data with the R scale function, which computes the z-score for each variable.

In R, with _std function:

loan_df <- model.matrix(~ -1 + payment_inc_ratio + dti + revol_bal +
                          revol_util, data=loan_data)
loan_std <- scale(loan_df)
newloan_std <- loan_std[1, , drop=FALSE]
loan_std <- loan_std[-1,]
loan_df <- loan_df[-1,]  1
outcome <- loan_data[-1, 1]
knn_pred <- knn(train=loan_std, test=newloan_std, cl=outcome, k=5)
loan_df[attr(knn_pred, "nn.index"),]
        payment_inc_ratio   dti  revol_bal  revol_util
2081            2.61091    1.03       1218         9.7
1439            2.34343    0.51        278         9.9
30216           2.71200    1.34       1075         8.5
28543           2.39760    0.74       2917         7.4
44738           2.34309    1.37        488         7.2

In Python, the sklearn.preprocessing.StandardScaler is trained with predictors and then transforms the data before training the KNN model.

newloan = loan_data.loc[0:0, predictors]
X = loan_data.loc[1:, predictors]
y = loan_data.loc[1:, outcome]
  
scaler = preprocessing.StandardScaler()
scaler.fit(X * 1.0)
  
X_std = scaler.transform(X * 1.0)
newloan_std = scaler.transform(newloan * 1.0)
  
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_std, y)
  
nbrs = knn.kneighbors(newloan_std)
X.iloc[nbrs[1][0], :]

The five nearest neighbors are similar across all variables, leading to a sensible result. The results are on the original scale, but KNN was applied to scaled data for predicting the new loan.

💡 Extra Tips:

Other Ways to Scale:
- Median & IQR (Interquartile Range): More robust to outliers
- 0–1 Scaling (Min-Max)
  - Compresses all values into [0, 1]
  - Formula: $\frac{x - \text{min}}{\text{max} - \text{min}}$

Important Insight: Standardizing implies that every feature is equally important. But if we know one feature (e.g., payment_inc_ratio) is more predictive, we can weight it more in the scaling.

Choosing K

The number K indicates how many nearby data points in a dataset are considered for making predictions. It significantly affects the model’s sensitivity to training data variations, impacting the alignment of predictions with actual values.

Generally, if $K$ is too low, we may overfit by including noise in the data. Higher values of K offer smoothing that lowers the risk of overfitting within the training data. Conversely, if $K$ is too high, we might oversmooth the data and lose KNN’s ability to capture the local structure, which is one of its main advantages.

The best practice is to try multiple $K$ values and use cross-validation to find the best one for your specific dataset. The common choices are $K = 3, 5, 7, …, 15$ (often odd to avoid ties).

Overfitting vs. Oversmoothing:

Overfitting happens when K is too small: the model memorizes noise in training data → high variance.
Oversmoothing happens when K is too large: the model averages too much and misses patterns → high bias.

Bias-Variance Trade-off

The tension between oversmoothing and overfitting is an instance of the bias-variance trade-off, a ubiquitous problem in statistical model fitting.

Concept	What It Means
Bias	Error from wrong assumptions. High bias = underfitting.
Variance	Error from sensitivity to small changes in training data. High variance = overfitting.

In KNN, we usually considers as follows.

Small K → low bias, high variance
Large K → high bias, low variance

When a flexible model is overfit, the variance increases. You can reduce this by using a simpler model, but the bias may increase due to the loss of flexibility in modeling the real underlying situation. A general approach to handling this trade-off is through cross-validation.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day140 - STAT Review: Statistical Machine Learning (2)

Practical Statistics for Data Scientists: K-Nearest Neighbors (2) (Standardization & Choosing K)

Standardization

💡 Extra Tips:

Choosing K

Overfitting vs. Oversmoothing:

Bias-Variance Trade-off

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)