Day141 - STAT Review: Statistical Machine Learning (3)
Practical Statistics for Data Scientists: K-Nearest Neighbors (3) (KNN as a Feature Engine) & Tree Models (1) (Key Concepts)
KNN as a Feature Engine
KNN is not always the best-performing final model, especially compared to advanced models like random forests or gradient boosting. But—it’s beneficial as a feature engineering tool.
The idea comes from using KNN to generate new features that describe local behavior in the data and feed those into a more robust model later.
In practical model fitting, however, KNN can contribute “local knowledge” in a staged process alongside other classification techniques:
- KNN is run on the data, and a classification (or quasi-probability of a class) is derived for each record.
- That result is added as a new feature to the record, and another classification method is then run on the data. Thus, the original predictor variables are used twice.
You may wonder if using some predictors twice creates multicollinearity. It does not, as the second-stage model incorporates highly local information from a few nearby records, providing additional, non-redundant information.
How It Works – Step-by-Step:
- Run KNN on your dataset (either classification or regression).
- For each record, calculate:
- Its predicted class, or
- The probability (e.g., likelihood of default)
- Add that output as a new feature in the dataset.
- Now, train a more robust model (e.g., logistic regression, decision tree, or XGBoost) on the entire dataset, including the new KNN-based feature.
For example, consider the King County housing data we reviewed earlier. When pricing a home for sale, a realtor determines the price based on recently sold comparable homes, commonly called “comps.”
In essence, realtors utilize a manual approach similar to KNN; they assess the sale prices of comparable homes to estimate a property’s selling price. By using KNN on recent sales data, we can create a new feature for a statistical model that mimics real estate professionals.
The predicted value is the sales price, and the existing predictor variables may include location, total square footage, type of structure, lot size, and the number of bedrooms and bathrooms. The new predictor variable (feature) we introduce through KNN is the KNN predictor for each record, similar to the realtors’ comps. Since we are predicting a numerical value, we utilize the average of the K-Nearest Neighbors instead of a majority vote, a method known as KNN regression.
Similarly, we can create features representing different aspects of the loan process for the data. For example, the following R code would build a feature representing a borrower’s creditworthiness.
-
In R
borrow_df <- model.matrix(~ -1 + dti + revol_bal + revol_util + open_acc + delinq_2yrs_zero + pub_rec_zero, data=loan_data) borrow_knn <- knn(borrow_df, test=borrow_df, cl=loan_data[, 'outcome'], prob=TRUE, k=20) prob <- attr(borrow_knn, "prob") borrow_feature <- ifelse(borrow_knn == 'default', prob, 1 - prob) summary(borrow_feature) --- Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.400 0.500 0.501 0.600 0.950
-
In Python, we use
predict_proba
method fromscikit-learn
of the trained model to get the probabilities.predictors = ['dti', 'revol_bal', 'revol_util', 'open_acc', 'delinq_2yrs_zero', 'pub_rec_zero'] outcome = 'outcome' X = loan_data[predictors] y = loan_data[outcome] knn = KNeighborsClassifier(n_neighbors=20) knn.fit(X, y) loan_data['borrower_score'] = knn.predict_proba(X)[:, 1] loan_data['borrower_score'].describe()
The result is a feature that predicts the likelihood of a borrower defaulting based on their credit history.
Tree Models
Tree models, also known as Classification and Regression Trees (CART), decision trees, or simply trees, are an effective and popular method for machine learning classification (and regression).
- Classification: Predicting categories (e.g., will a customer default? Yes/No
- Regression: Predicting continuous values (e.g., what price will this house sell for?)
These models, along with their more powerful derivatives like random forests and boosted trees (see “Bagging and the Random Forest” and “Boosting”), form the foundation of the most widely used and robust predictive modeling tools in data science for both regression and classification.
Think of it like:
A series of yes/no questions that narrow down your dataset until a final prediction is made.
For example:
Is income > $50,000?
├── Yes → Is age > 30?
│ ├── Yes → Predict: No Default
│ └── No → Predict: Default
└── No → Predict: Default
Key Terms for Trees
- Recursive Partitioning
- Consistently divide and subdivide the data to ensure the outcomes in each final subdivision are as homogeneous as possible.
- Start with all your data.
- At each step, find the best feature and value to split the data into two groups that are as pure as possible.
- Repeat this process recursively on the resulting partitions.
- Consistently divide and subdivide the data to ensure the outcomes in each final subdivision are as homogeneous as possible.
- Split Value
- A predictor value distinguishes records based on whether it is less than or greater than the split value.
- Node
- In the decision tree, a node represents a split value.
- Leaf
- The conclusion of if-then rules, or branches of a tree, represents a classification criterion for any record within that tree.
- Loss
- The number of misclassifications during a stage of the splitting process. It indicates that more losses lead to greater impurity.
- Impurity
- The extent to which a combination of classes exists in a subpartition of the data (the more diverse, the more impure).
- Common impurity metrics:
- Gini impurity
- Entropy (from information theory)
- Mean Squared Error (MSE) for regression trees
- Common impurity metrics:
- $=$ Heterogeneity
- $\leftrightarrow $ Homogeneity, purity
- The extent to which a combination of classes exists in a subpartition of the data (the more diverse, the more impure).
- Pruning
- Triming a mature tree by cutting branches to prevent overfitting.
- Pre-pruning: Stop growing early (e.g., if node has < 5 records)
- Post-pruning: Grow the full tree, then cut back based on validation performance
- Triming a mature tree by cutting branches to prevent overfitting.
A tree model consists of simple “if-then-else” rules. Unlike linear and logistic regression, trees can reveal complex data patterns. Moreover, simple tree models, unlike KNN or naive Bayes, convey easily interpretable predictor relationships. Below is a more visualized comparison table.
Comparison with Other Models
Model | Strengths | Weaknesses |
---|---|---|
Linear/Logistic Regression | Simple, interpretable | Assumes linearity |
KNN | Flexible, no training | Not interpretable, sensitive to scaling |
Naive Bayes | Fast, probabilistic | Assumes independence |
Decision Trees | Highly interpretable, handles non-linearity well | Prone to overfitting unless pruned |
Leave a comment