Day45 ML Review - K-Nearest Neighbors (2)
Distance Metrics- Euclidean, Manhattan, Minkowski & Chebyshev Distance, and Cosine Similarity
Distance Metrics
From the previous posting, here’s how we can implement KNN for a classification task using Scikit-learn:
from sklearn.neighbors import KneighborsClassifier
knn = KmeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
# Train the model (fit the model)
knn.fit(X_train_std, y_train)
# Predict on the test set
y_pred = knn.predict(X_test)
In the context of K-Nearest Neighbors (KNN), setting p=2
refers to choosing the Minkowski distance with a parameter $p=2$, which specifically corresponds to the Euclidean distance.
Let’s explore various distance metrics such as Minkowski, Euclidean, Manhattan, and Chebyshev distances. These metrics calculate the “distance” or “similarity” between two points in a space. The choice of distance metric can have a significant impact on the performance of machine learning algorithms like K-Nearest Neighbors (KNN).
1. Euclidean Distance
Euclidean distance is the most commonly used distance metric. It represents the straight-line distance between two points in an Euclidean space.
-
Formula: For two points $A = (x_1, x_2, \dots, x_n)$ and $B= (y_1, y_2, \dots, y_n)$ is an $n$- dimensional space, the Euclidean distance $D$ is given by:
$D(A,B) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$
-
Example: For points $A=(3,4)$ and $B=(7,1)$:
$D(A,B) = \sqrt{(7-3)^2+(1-4)^2} = \sqrt{16+9} = \sqrt{25} = 5$ -
Applications:
- K-Nearest Neighbors (KNN)
- Clustering (e.g., K-Means)
- Principal Component Analysis (PCA)
2. Manhattan Distance
Manhattan distance (also known as the L1 norm or taxicab distance) is the sum of the absolute differences of the coordinates. It measures how far apart two points are if we can only move along grid lines (like streets in a city).
-
Formula: For two points $A=(x_1,x_2, \dots, x_n)$ and $B=(y_1, y_2, \dots, y_n)$ is an $n-$dimensional space, the Manhattan distance $D$ is given by:
$D(A, B) = \sum_{i=1}^{n} |x_i - y_i|$
-
Example: For points $A=(3,4)$ and $B=(7,1)$:
$ D(A, B) = |7-3| + |1-4| = 4 + 3 = 7$
-
Applications:
- Urban planning (movement along grid-like paths)
- Robustness to outliers in high-dimensional spaces
- Sparse data where many coordinates are zero
3. Minkowski Distance
Minkowski distance is a generalization of both Euclidean and Manhattan distances. It includes these and other distances as special cases based on the parameter $p$.
where:
- $A = (x_1, x_2, \dots, x_n)$ and $B = (y_1, y_2, \dots, y_n)$ are two points in $n$-dimensional space.
-
$p$ is a parameter that determines the type of distance.
- $p=1$: Manhattan distance
- $p=2$ : Euclidean distance
- $p= \infty $: Chebyshev distance (maximum difference)
- Example: For points $A=(3,4)$ and $B=(7,1)$:
- Applications:
- Custom distance metrics
- Flexibility to adjust sensitivity to differences between points
4. Chebyshev Mdistance
Chebyshev distance (also known as the L$\infty$ norm or maximum metric) measures the greatest of the absolute differences between the coordinates of a pair of points. It represents the minimum number of moves required to reach one point from another in a grid where diagonal movement is allowed.
-
Formula: For two points $A = (x_1, x_2, \dots, x_n)$ and $B = (y_1, y_2, \dots, y_n) $ in an $n$-dimensional space, the Chebyshev distance $D$ is given by:
$ D(A, B) = \max_{i=1}^{n} |x_i - y_i|$
-
Example: For points $A = (3, 4)$ and $B = (7, 1)$:
$D(A, B) = \max(|7-3|, |1-4|) = \max(4, 3) = 4$
-
Applications:
- Chess (King’s move): The distance a king moves between two squares on a chessboard.
- Board games and grid-based algorithms: Where diagonal movement is possible.
5. Cosine Similarity
Cosine similarity is not a traditional distance metric, but it is commonly used to measure the cosine of the angle between two non-zero vectors in an inner product space. It is frequently utilized in high-dimensional spaces to assess the similarity between two vectors. The cosine distance can be derived as $1- $ cosine similarity.
-
Formula: For two vectors $A = (x_1, x_2, \dots, x_n)$ and $B = (y_1, y_2, \dots, y_n)$, the cosine similarity $S$ is given by:
$S(A, B) = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \cdot \sqrt{\sum_{i=1}^{n} y_i^2}}$
Cosine Distance is then:
$\text{Cosine Distance} = 1 - S(A, B)$
-
Example: For vectors $A = (1, 2, 3) $ and $B = (4, 5, 6)$:
$S(A, B) = \frac{(1 \times 4) + (2 \times 5) + (3 \times 6)}{\sqrt{1^2 + 2^2 + 3^2} \cdot \sqrt{4^2 + 5^2 + 6^2}} = \frac{4 + 10 + 18}{\sqrt{14} \cdot \sqrt{77}} = \frac{32}{\sqrt{1078}} \approx 0.9746$
Cosine Distance is then:
$\text{Cosine Distance} = 1 - 0.9746 = 0.0254 $
-
Applications:
- Text mining and document similarity: Measuring similarity between documents based on word frequency vectors.
- Recommendation systems: Similarity between user preferences.
- High-dimensional data: Where Euclidean distance may be less meaningful due to the curse of dimensionality.
Leave a comment