kNN, K-nearest neighbor with Python

Negar Khojasteh
3 min readFeb 8, 2020
Photo by Providence Doucet on Unsplash

I wrote a post about implementing Naive Bayes (with python code). Bayes is simple and quick to code but there are other classifiers with higher accuracy. In this post, I’ll talk about K-nearest neighbor (kNN) which in most cases is more accurate than Bayes.

kNN is a vector space classifier. Take a look at this post to learn more about vector space in ML.

Vector space model in machine learning

The simplest way to remember is to think about features as dimensions. For example in the blog post above the author has talked about cars. Features of a car could be its “model”, “max speed”, etc. If you pick 3 features you can indicate the location of each instance (e.g., car) on a three-dimensional graph (x, y, z).

three features of the cars and three dimensions (graph adapted from here)

However, as the features increase to larger numbers (larger n) it becomes more and more difficult to visualize the n-dimensional space.

Thankfully, to learn and use kNN we don’t need to directly visualize high dimensional spaces. However, keep in mind that similar to 3D space where we divide the space using the planes (e.g., xy plane, yz plane, etc), a similar division could be done in n-dimensional space with surface and hypersurfaces.

Here is a great example of kNN that I’ve adapted from the class I took.

Imagine you want to classify the star as O or X.

what would be the class of star considering the nearest neighbor? the answer is O because the nearest neighbor to star is an O. So for 1-NN the answer is O.

How about 3-NN? this time, we consider 3 nearest neighbors and among those two are X and one is O so the answer is classifying star as X.

9-NN: classifies star as O. 15-NN classifies it as X.

Here is a more formal definition: (assuming we are classifying documents)

kNN= k nearest neighbors

kNN classification rule for k=1: Assign each test document to the class of its nearest neighbors in the training set.

considering only 1 document is not robust since that one document in the train set might be labeled wrong. It’s better to have a k>1:

for K>1: Assign each test document to the majority class of its k nearest neighbors in the training set.

Note that kNN works well with large train sets but for small train sets it could get very inaccurate. For very large data sets kNN is computationally costly and inefficient.

If you’re interested to know more about kNN take a look at this post which also explains how to decide on the value of k.

A simple introduction to kNN

There are so many good resources online on implementing/using kNN but I found most of them a bit complicated for beginners. So I wrote a simple code with only essentials :)

In this code I’ve used the algorithm from scikit learn. (highly recommended, easy to code!)

see my code on nbviewer

Other posts you might be interested in:

kNN by Swapna Patil

Popular clustering methods

KNN from data fiction

KNN with python code

--

--