Day140 - STAT Review: Statistical Machine Learning (2)
Practical Statistics for Data Scientists: K-Nearest Neighbors (2) (Standardization & Choosing K)
Standardization
In measurement, we are often more interested in “how different from the average “ rather than “ how much.” Standardization, also known as normalization, places all variables on similar scales by subtracting the mean and dividing by the standard deviation. This ensures that a variable does not disproportionately influence a model simply due to the scale of its original measurement.
The formula is: \(z = \frac{x - \bar{x}}{s}\) Where:
- $x$ = the original value
- $\bar{x}$ = the mean of the feature
- $s$ = the standard deviation
This gives us the z-score, which tells us how many standard deviations a value is from the mean.
Why is standardization necessary? Some features have larger numeric scales than others, which can dominate distance-based algorithms such as K-Nearest Neighbors (KNN).
In the loan example:
revol_bal
is in dollars (e.g., 1,687)payment_inc_ratio
might be single digits (e.g., 2.3)
So when KNN calculates distance:
- It mostly notices differences in
revol_bal
(the total revolving credit available to the applicant in dollars) - And ignores the more important small-scale features like `dti` and `revol_util` (the percent of the credit being used)
newloan
---
payment_inc_ratio dti revol_bal revol_util
1 2.3932 1 1687 9.4
The dollar value of revol_bal
is considerably higher than the other variables. The knn
function provides the index of the nearest neighbors through nn.index
, allowing us to display the top five closest rows in loan_df
.
-
In R
loan_df <- model.matrix(~ -1 + payment_inc_ratio + dti + revol_bal + revol_util, data=loan_data) newloan <- loan_df[1, , drop=FALSE] loan_df <- loan_df[-1,] outcome <- loan_data[-1, 1] knn_pred <- knn(train=loan_df, test=newloan, cl=outcome, k=5) loan_df[attr(knn_pred, "nn.index"),] --- payment_inc_ratio dti revol_bal revol_util 35537 1.47212 1.46 1686 10.0 33652 3.38178 6.37 1688 8.4 25864 2.36303 1.39 1691 3.5 42954 1.28160 7.14 1684 3.9 43600 4.12244 8.98 1684 7.2
-
In Python, following the model fit, we can use the
kneighbors
method to identify the five closest rows in the training set withscikit-learn
.from sklearn import preprocessing from sklearn.neighbors import KNeighborsClassifier predictors = ['payment_inc_ratio', 'dti', 'revol_bal', 'revol_util'] outcome = 'outcome' newloan = loan_data.loc[0:0, predictors] X = loan_data.loc[1:, predictors] y = loan_data.loc[1:, outcome] knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X, y) nbrs = knn.kneighbors(newloan) X.iloc[nbrs[1][0], :]
The
revol_bal
values in these neighbors are similar to the new record, but other predictor variables vary widely and do not significantly influence neighbor determination.
Compare this to KNN on standardized data with the R scale function, which computes the z-score for each variable.
-
In R, with
_std
function:loan_df <- model.matrix(~ -1 + payment_inc_ratio + dti + revol_bal + revol_util, data=loan_data) loan_std <- scale(loan_df) newloan_std <- loan_std[1, , drop=FALSE] loan_std <- loan_std[-1,] loan_df <- loan_df[-1,] 1 outcome <- loan_data[-1, 1] knn_pred <- knn(train=loan_std, test=newloan_std, cl=outcome, k=5) loan_df[attr(knn_pred, "nn.index"),] payment_inc_ratio dti revol_bal revol_util 2081 2.61091 1.03 1218 9.7 1439 2.34343 0.51 278 9.9 30216 2.71200 1.34 1075 8.5 28543 2.39760 0.74 2917 7.4 44738 2.34309 1.37 488 7.2
-
In Python, the
sklearn.preprocessing.StandardScaler
is trained with predictors and then transforms the data before training the KNN model.newloan = loan_data.loc[0:0, predictors] X = loan_data.loc[1:, predictors] y = loan_data.loc[1:, outcome] scaler = preprocessing.StandardScaler() scaler.fit(X * 1.0) X_std = scaler.transform(X * 1.0) newloan_std = scaler.transform(newloan * 1.0) knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_std, y) nbrs = knn.kneighbors(newloan_std) X.iloc[nbrs[1][0], :]
The five nearest neighbors are similar across all variables, leading to a sensible result. The results are on the original scale, but KNN was applied to scaled data for predicting the new loan.
💡 Extra Tips:
- Other Ways to Scale:
- Median & IQR (Interquartile Range): More robust to outliers
- 0–1 Scaling (Min-Max)
- Compresses all values into [0, 1]
- Formula: $\frac{x - \text{min}}{\text{max} - \text{min}}$
Important Insight: Standardizing implies that every feature is equally important.
But if we know one feature (e.g., payment_inc_ratio
) is more predictive, we can weight it more in the scaling.
Choosing K
The number K indicates how many nearby data points in a dataset are considered for making predictions. It significantly affects the model’s sensitivity to training data variations, impacting the alignment of predictions with actual values.
Generally, if $K$ is too low, we may overfit by including noise in the data. Higher values of K offer smoothing that lowers the risk of overfitting within the training data. Conversely, if $K$ is too high, we might oversmooth the data and lose KNN’s ability to capture the local structure, which is one of its main advantages.
The best practice is to try multiple $K$ values and use cross-validation to find the best one for your specific dataset. The common choices are $K = 3, 5, 7, …, 15$ (often odd to avoid ties).
Overfitting vs. Oversmoothing:
- Overfitting happens when K is too small: the model memorizes noise in training data → high variance.
- Oversmoothing happens when K is too large: the model averages too much and misses patterns → high bias.
Bias-Variance Trade-off
The tension between oversmoothing and overfitting is an instance of the bias-variance trade-off, a ubiquitous problem in statistical model fitting.
Concept | What It Means |
---|---|
Bias | Error from wrong assumptions. High bias = underfitting. |
Variance | Error from sensitivity to small changes in training data. High variance = overfitting. |
In KNN, we usually considers as follows.
- Small K → low bias, high variance
- Large K → high bias, low variance
When a flexible model is overfit, the variance increases. You can reduce this by using a simpler model, but the bias may increase due to the loss of flexibility in modeling the real underlying situation. A general approach to handling this trade-off is through cross-validation.
Leave a comment