Up: Classification [Top][Contents]

K-nearest neighbors

a.k.a.

KNN classifier
k-NN classifier

Define

… as a nonparametric method of classification that assigns to \(x\)

the most common class among \(k\) nearest neighbors

of \(x\) by Euclidean distance.

Discuss

Close in precision to the optimal Bayes classifier.

Parameterize

Choose \(k\) wisely, per the bias-variance trade-off.
- A small \(k \approx 1\) leads to overfitting.
- A large \(k\) leads to the opposite.
Choose an odd number for \(k\) to avoid ties.
Seed the random number generator, ditto if ties are probable.

Explore

Import NumPy, Matplotlib, and Scikit-learn.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

Ensure reproducibility.

import random
random.seed(0)

Use a dark theme. ☺

plt.style.use('dark_background')
plt.rcParams.update({'savefig.transparent': True})

Define the training data as

2-dimensional array of predictors and
1-dimensional array of classification labels.

(In Scikit-learn, these go by ‘data’ and ‘target’, respectively.)

training_data = np.array([[1, 1],
                          [1, 2],
                          [2, 1],
                          [6, 8],
                          [7, 7],
                          [8, 6]])
training_target = np.array([0, 0, 0, 1, 1, 1])
training = np.c_[training_data, training_target]

training

x	y	class
1	1	0
1	2	0
2	1	0
6	8	1
7	7	1
8	6	1

Visualize the training data.

training_class_0_mask = training[:, 2] == 0
training_class_0_x = training[training_class_0_mask][:, 0]
training_class_0_y = training[training_class_0_mask][:, 1]

training_class_1_mask = training[:, 2] == 1
training_class_1_x = training[training_class_1_mask][:, 0]
training_class_1_y = training[training_class_1_mask][:, 1]

plt.figure(figsize=(5, 5))
plt.scatter(training_class_0_x, training_class_0_y, marker='D')
plt.scatter(training_class_1_x, training_class_1_y, marker='s')
plt.grid(True, alpha = 0.25)

plt

babel-results/f5fe7c18-e9c1-40f8-805c-ceb9bf98d4bd

Define the test data.

test_data = np.array([[1, 6],
                      [1, 8],
                      [3, 8],
                      [6, 1],
                      [8, 1],
                      [8, 3]])

test_data

x	y
1	6
1	8
3	8
6	1
8	1
8	3

Visualize the test data.

test_data_x = test_data[:, 0]
test_data_y = test_data[:, 1]

plt.figure(figsize=(5, 5))
plt.scatter(test_data_x, test_data_y, marker='*')
plt.grid(True, alpha = 0.25)

plt

babel-results/ebc9bed5-b7ec-4a4c-97a0-925a75d0c918

Create a KNN model with the neighborhood size \(k = 3\).
Train, or fit, the model to the training data.

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(training_data, training_target)

Classify the test data.

test_target = knn.predict(test_data)
test = np.c_[test_data, test_target]

test

x	y	class
1	6	0
1	8	1
3	8	1
6	1	0
8	1	1
8	3	1

Visualize the test data next to the training data.

both = np.r_[training, test]

both_class_0_mask = both[:, 2] == 0
both_class_0_x = both[both_class_0_mask][:, 0]
both_class_0_y = both[both_class_0_mask][:, 1]

both_class_1_mask = both[:, 2] == 1
both_class_1_x = both[both_class_1_mask][:, 0]
both_class_1_y = both[both_class_1_mask][:, 1]

plt.figure(figsize=(5, 5))
plt.scatter(both_class_0_x, both_class_0_y, marker='D')
plt.scatter(both_class_1_x, both_class_1_y, marker='s')
plt.grid(True, alpha = 0.25)

plt

babel-results/bd032dc8-8eeb-4675-becb-66e1661d150f