Day141 - STAT Review: Statistical Machine Learning (3)

March 20, 2025 4 minute read

Practical Statistics for Data Scientists: K-Nearest Neighbors (3) (KNN as a Feature Engine) & Tree Models (1) (Key Concepts)

B3873656-BE2F-47E8-834D-C1A399D07448_1_105_c

KNN as a Feature Engine

KNN is not always the best-performing final model, especially compared to advanced models like random forests or gradient boosting. But—it’s beneficial as a feature engineering tool.

The idea comes from using KNN to generate new features that describe local behavior in the data and feed those into a more robust model later.

In practical model fitting, however, KNN can contribute “local knowledge” in a staged process alongside other classification techniques:

KNN is run on the data, and a classification (or quasi-probability of a class) is derived for each record.
That result is added as a new feature to the record, and another classification method is then run on the data. Thus, the original predictor variables are used twice.

You may wonder if using some predictors twice creates multicollinearity. It does not, as the second-stage model incorporates highly local information from a few nearby records, providing additional, non-redundant information.

How It Works – Step-by-Step:

Run KNN on your dataset (either classification or regression).
For each record, calculate:
- Its predicted class, or
- The probability (e.g., likelihood of default)
Add that output as a new feature in the dataset.
Now, train a more robust model (e.g., logistic regression, decision tree, or XGBoost) on the entire dataset, including the new KNN-based feature.

For example, consider the King County housing data we reviewed earlier. When pricing a home for sale, a realtor determines the price based on recently sold comparable homes, commonly called “comps.”

In essence, realtors utilize a manual approach similar to KNN; they assess the sale prices of comparable homes to estimate a property’s selling price. By using KNN on recent sales data, we can create a new feature for a statistical model that mimics real estate professionals.

The predicted value is the sales price, and the existing predictor variables may include location, total square footage, type of structure, lot size, and the number of bedrooms and bathrooms. The new predictor variable (feature) we introduce through KNN is the KNN predictor for each record, similar to the realtors’ comps. Since we are predicting a numerical value, we utilize the average of the K-Nearest Neighbors instead of a majority vote, a method known as KNN regression.

Similarly, we can create features representing different aspects of the loan process for the data. For example, the following R code would build a feature representing a borrower’s creditworthiness.

In R

borrow_df <- model.matrix(~ -1 + dti + revol_bal + revol_util + open_acc +
                          delinq_2yrs_zero + pub_rec_zero, data=loan_data)
borrow_knn <- knn(borrow_df, test=borrow_df, cl=loan_data[, 'outcome'],
                  prob=TRUE, k=20)
prob <- attr(borrow_knn, "prob")
borrow_feature <- ifelse(borrow_knn == 'default', prob, 1 - prob)
summary(borrow_feature)
---
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.000   0.400   0.500   0.501   0.600   0.950

In Python, we use predict_proba method from scikit-learn of the trained model to get the probabilities.

predictors = ['dti', 'revol_bal', 'revol_util', 'open_acc',
              'delinq_2yrs_zero', 'pub_rec_zero']
outcome = 'outcome'
  
X = loan_data[predictors]
y = loan_data[outcome]
  
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X, y)
  
loan_data['borrower_score'] = knn.predict_proba(X)[:, 1]
loan_data['borrower_score'].describe()

The result is a feature that predicts the likelihood of a borrower defaulting based on their credit history.

Tree Models

Tree models, also known as Classification and Regression Trees (CART), decision trees, or simply trees, are an effective and popular method for machine learning classification (and regression).

Classification: Predicting categories (e.g., will a customer default? Yes/No
Regression: Predicting continuous values (e.g., what price will this house sell for?)

These models, along with their more powerful derivatives like random forests and boosted trees (see “Bagging and the Random Forest” and “Boosting”), form the foundation of the most widely used and robust predictive modeling tools in data science for both regression and classification.

Think of it like:

A series of yes/no questions that narrow down your dataset until a final prediction is made.

For example:

Is income > $50,000?
├── Yes → Is age > 30?
│   ├── Yes → Predict: No Default
│   └── No  → Predict: Default
└── No  → Predict: Default

Key Terms for Trees

Recursive Partitioning
- Consistently divide and subdivide the data to ensure the outcomes in each final subdivision are as homogeneous as possible.
  1. Start with all your data.
  2. At each step, find the best feature and value to split the data into two groups that are as pure as possible.
  3. Repeat this process recursively on the resulting partitions.
Split Value
- A predictor value distinguishes records based on whether it is less than or greater than the split value.
Node
- In the decision tree, a node represents a split value.
Leaf
- The conclusion of if-then rules, or branches of a tree, represents a classification criterion for any record within that tree.
Loss
- The number of misclassifications during a stage of the splitting process. It indicates that more losses lead to greater impurity.
Impurity
- The extent to which a combination of classes exists in a subpartition of the data (the more diverse, the more impure).
  - Common impurity metrics:
    - Gini impurity
    - Entropy (from information theory)
    - Mean Squared Error (MSE) for regression trees
- $=$ Heterogeneity
- $\leftrightarrow $ Homogeneity, purity
Pruning
- Triming a mature tree by cutting branches to prevent overfitting.
  - Pre-pruning: Stop growing early (e.g., if node has < 5 records)
  - Post-pruning: Grow the full tree, then cut back based on validation performance

A tree model consists of simple “if-then-else” rules. Unlike linear and logistic regression, trees can reveal complex data patterns. Moreover, simple tree models, unlike KNN or naive Bayes, convey easily interpretable predictor relationships. Below is a more visualized comparison table.

Comparison with Other Models

Model	Strengths	Weaknesses
Linear/Logistic Regression	Simple, interpretable	Assumes linearity
KNN	Flexible, no training	Not interpretable, sensitive to scaling
Naive Bayes	Fast, probabilistic	Assumes independence
Decision Trees	Highly interpretable, handles non-linearity well	Prone to overfitting unless pruned

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day141 - STAT Review: Statistical Machine Learning (3)

Practical Statistics for Data Scientists: K-Nearest Neighbors (3) (KNN as a Feature Engine) & Tree Models (1) (Key Concepts)

KNN as a Feature Engine

How It Works – Step-by-Step:

Tree Models

Think of it like:

Key Terms for Trees

Comparison with Other Models

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)