Day44 ML Review - K-Nearest Neighbors (1)

August 1, 2024 3 minute read

Basic Concepts, How It Works, and Parametric & Non-Parametric Model

The K-Nearest Neighbor (KNN) classifier is especially interesting because it fundamentally differs from the learning algorithms we’ve discussed. KNN is a typical example of a “lazy learner.” It is called “lazy” not because of its apparent simplicity but because it doesn’t learn a discriminative function from the training data; instead, it memorizes the training dataset.

Key Concepts

Instance-Based Learning: KNN is an instance-based learning algorithm that memorizes the training dataset and makes predictions by comparing new data points directly to the stored instances.
Non-Parametric: KNN does not make any assumptions about the underlying data distribution. This makes it flexible but potentially sensitive to irrelevant features and noise in the data.
Lazy Learning: KNN is called a lazy learner because it doesn’t learn an explicit model during the training phase. Lazy Learning is a machine learning algorithm that delays the generalization or model-building process until a query is made. When we train a KNN model, it doesn’t create a generalized model based on the data. Instead, it simply stores the training data. All the computation happens in the prediction phase.

How KNN Works

Choosing K: The algorithm first selects the number of neighbors $K$ (a positive integer). The value of $K$ is a critical hyperparameter that controls the model’s balance between bias and variance.
Distance Metric: KNN typically uses a distance metric (e.g., Euclidean distance) to measure the similarity between instances. The distance between two points $A = (x_1, y_1)$ and $B = (x_2, y_2)$ in a 2D space using Euclidean distance is calculated as:
$\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$

Other metrics, such as Manhattan distance, Minkowski distance, or Hamming distance, can also be used depending on the problem.
Finding Neighbors: For a given test instance, the algorithm computes the distance to all training data points and identifies the $K$ closest instances (neighbors).
Prediction:
- Classification: The most common class among the $K$ nearest neighbors are assigned as the predicted class for the test instance. This is typically done using a majority vote.
- Regression: The predicted value is the average (or sometimes the weighted average) of the values of the $K$ nearest neighbors.

Advantages of KNN

Simplicity: KNN is easy to understand and implement.
No Training Phase: Since KNN is a lazy learner, it doesn’t involve a training phase, making it quick to set up.
Flexibility: This can be used for both classification and regression tasks.

Disadvantages of KNN

Computational Complexity: KNN can be computationally expensive, especially with large datasets, since it requires calculating the distance between the test point and all training points for each prediction.
Memory Intensive: KNN stores all the training data, which can be memory-intensive for large datasets.
Sensitivity to Irrelevant Features: KNN can be significantly affected by irrelevant or redundant features, which can distort distance measurements.
Choice of $K$: The choice of $K$ is crucial. A small $K$ can lead to overfitting (high variance), while a large $K $can lead to underfitting (high bias).

Example of KNN in Scikit-learn

Here’s how we can implement KNN for a classification task using Scikit-learn:

from sklearn.neighbors import KneighborsClassifier
knn = KmeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')

# Train the model (fit the model)
knn.fit(X_train_std, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

In the context of K-Nearest Neighbors (KNN), setting p=2 refers to choosing the Minkowski distance with a parameter ( p = 2 ), which precisely corresponds to the Euclidean distance.

Side note

In machine learning, algorithms can be broadly classified into parametric and nonparametric models based on their approaches to learning and prediction.

Feature	Parametric Models	Non-Parametric Models
Number of Parameters	Fixed, determined before training	Flexible, can grow with the size of the data
Assumptions	Strong assumptions about the data distribution	Few or no assumptions about the data distribution
Flexibility	Less flexible may underfit complex data	More flexible, can capture complex patterns
Data Requirement	Requires less data	Requires more data to avoid overfitting
Computation	Generally faster and more efficient	More computationally intensive
Risk of Overfitting	is Lower, but may underfit	Higher, but can be controlled with regularization techniques

Choosing Between Parametric and Non-Parametric Models

Parametric Models are typically chosen when you have substantial prior knowledge about the form of the data distribution or when computational efficiency and simplicity are essential.
Non-parametric models are preferred when flexibility is crucial, and you want the model to learn complex patterns from the data without imposing rigid assumptions.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day44 ML Review - K-Nearest Neighbors (1)

Basic Concepts, How It Works, and Parametric & Non-Parametric Model

Key Concepts

How KNN Works

Advantages of KNN

Disadvantages of KNN

Example of KNN in Scikit-learn

Side note

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)