Day44 ML Review - K-Nearest Neighbors (1)
Basic Concepts, How It Works, and Parametric & Non-Parametric Model
The K-Nearest Neighbor (KNN) classifier is especially interesting because it fundamentally differs from the learning algorithms we’ve discussed. KNN is a typical example of a “lazy learner.” It is called “lazy” not because of its apparent simplicity but because it doesn’t learn a discriminative function from the training data; instead, it memorizes the training dataset.
Key Concepts
- Instance-Based Learning: KNN is an instance-based learning algorithm that memorizes the training dataset and makes predictions by comparing new data points directly to the stored instances.
- Non-Parametric: KNN does not make any assumptions about the underlying data distribution. This makes it flexible but potentially sensitive to irrelevant features and noise in the data.
- Lazy Learning: KNN is called a lazy learner because it doesn’t learn an explicit model during the training phase. Lazy Learning is a machine learning algorithm that delays the generalization or model-building process until a query is made. When we train a KNN model, it doesn’t create a generalized model based on the data. Instead, it simply stores the training data. All the computation happens in the prediction phase.
How KNN Works
-
Choosing K: The algorithm first selects the number of neighbors $K$ (a positive integer). The value of $K$ is a critical hyperparameter that controls the model’s balance between bias and variance.
-
Distance Metric: KNN typically uses a distance metric (e.g., Euclidean distance) to measure the similarity between instances. The distance between two points $A = (x_1, y_1)$ and $B = (x_2, y_2)$ in a 2D space using Euclidean distance is calculated as:
$\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$
Other metrics, such as Manhattan distance, Minkowski distance, or Hamming distance, can also be used depending on the problem.
-
Finding Neighbors: For a given test instance, the algorithm computes the distance to all training data points and identifies the $K$ closest instances (neighbors).
-
Prediction:
- Classification: The most common class among the $K$ nearest neighbors are assigned as the predicted class for the test instance. This is typically done using a majority vote.
- Regression: The predicted value is the average (or sometimes the weighted average) of the values of the $K$ nearest neighbors.
Advantages of KNN
- Simplicity: KNN is easy to understand and implement.
- No Training Phase: Since KNN is a lazy learner, it doesn’t involve a training phase, making it quick to set up.
- Flexibility: This can be used for both classification and regression tasks.
Disadvantages of KNN
- Computational Complexity: KNN can be computationally expensive, especially with large datasets, since it requires calculating the distance between the test point and all training points for each prediction.
- Memory Intensive: KNN stores all the training data, which can be memory-intensive for large datasets.
- Sensitivity to Irrelevant Features: KNN can be significantly affected by irrelevant or redundant features, which can distort distance measurements.
- Choice of $K$: The choice of $K$ is crucial. A small $K$ can lead to overfitting (high variance), while a large $K $can lead to underfitting (high bias).
Example of KNN in Scikit-learn
Here’s how we can implement KNN for a classification task using Scikit-learn:
from sklearn.neighbors import KneighborsClassifier
knn = KmeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
# Train the model (fit the model)
knn.fit(X_train_std, y_train)
# Predict on the test set
y_pred = knn.predict(X_test)
In the context of K-Nearest Neighbors (KNN), setting p=2
refers to choosing the Minkowski distance with a parameter ( p = 2 ), which precisely corresponds to the Euclidean distance.
Side note
In machine learning, algorithms can be broadly classified into parametric and nonparametric models based on their approaches to learning and prediction.
Feature | Parametric Models | Non-Parametric Models |
---|---|---|
Number of Parameters | Fixed, determined before training | Flexible, can grow with the size of the data |
Assumptions | Strong assumptions about the data distribution | Few or no assumptions about the data distribution |
Flexibility | Less flexible may underfit complex data | More flexible, can capture complex patterns |
Data Requirement | Requires less data | Requires more data to avoid overfitting |
Computation | Generally faster and more efficient | More computationally intensive |
Risk of Overfitting | is Lower, but may underfit | Higher, but can be controlled with regularization techniques |
Choosing Between Parametric and Non-Parametric Models
- Parametric Models are typically chosen when you have substantial prior knowledge about the form of the data distribution or when computational efficiency and simplicity are essential.
- Non-parametric models are preferred when flexibility is crucial, and you want the model to learn complex patterns from the data without imposing rigid assumptions.
Leave a comment