KNN: Turning Data Proximity into Predictive Power

Ashish John Edward
Oct 19, 2024
11 min read

Updated: Oct 26, 2024

In a world where data is king, K-Nearest Neighbours (KNN) emerges as a powerful yet intuitive tool that can solve a wide range of business problems. Imagine being able to predict which customer will buy a product or estimate the price of a house based on similar data points from the past. KNN does exactly this, offering businesses a way to make data-driven decisions without complicated assumptions. Whether you're a marketer predicting customer behaviour or a real estate analyst estimating property prices, KNN can help you make smarter, faster, and more informed decisions by leveraging the power of proximity in your data. Ready to learn how this simple algorithm can revolutionize decision-making in your office? Let's dive in.

Lean Six Sigma for Business Transformation AI-Driven Process Improvement Solutions Customer Experience Optimization with Analytics Continuous Improvement in BPO Operations Data-Backed Strategies for Operational Excellence

In predictive modelling, we often seek algorithms that are not only powerful but also easy to understand and implement. Enter K-Nearest Neighbours (KNN) — a simple yet highly effective algorithm that brings intuition to the forefront of machine learning. Like a good neighbor who knows what's going on around the block, KNN uses the proximity of data points to make its predictions. In this article, we dig deep into the inner workings of KNN, uncover how it learns, and explore its practical applications.

K-Nearest Neighbours (KNN) is a non-parametric, instance-based learning algorithm that can be used for both classification and regression tasks. The core idea is straightforward: when given a new data point to classify or predict, KNN looks at the closest K neighbors to determine what the outcome should be.

For example, if we have a dataset of fruits categorized by their size and color, KNN will classify a new fruit by looking at the most similar fruits in the dataset.

Why Use KNN?

One of the strengths of KNN is that it requires no training phase, meaning that it stores the entire dataset and only performs computations when making predictions. This makes it highly adaptable and easy to implement. However, it also means that the algorithm can be computationally expensive, especially when dealing with large datasets, as KNN needs to calculate the distance from the new data point to every other point in the dataset.

Another advantage is its simplicity and interpretability. The decision-making process in KNN is transparent — you can literally see the data points being compared and how the majority vote or distance influences the outcome.

Intuition Behind KNN

At its heart, KNN works on the principle of similarity. It assumes that similar data points are close to each other in feature space. Therefore, to predict the outcome of a new data point, KNN calculates the distance between this point and all other points in the training set. Based on the proximity to its nearest neighbors, it makes a decision.

In classification, KNN assigns the label that is most common among the K nearest neighbors. In regression, it takes the average of the target values of the neighbors to predict the output.

To put it simply, KNN doesn’t assume any specific distribution of data (like linear or logistic regression does). Instead, it lets the data “speak for itself.”

Real life use cases for KNN

Let’s look at some real-life use cases of KNN before we jump into understand this by getting our hands dirty.

Zillow: Real Estate Price Prediction

Zillow, a leading online real estate marketplace, uses KNN to help estimate property prices through its Zestimate tool. For home buyers and sellers, having an accurate estimate of a property’s value is essential. Zillow employs KNN to predict the price of a house by comparing it to similar properties in the same neighbourhood, based on attributes like square footage, number of bedrooms, and recent sale prices of neighbouring houses. By finding homes with similar characteristics, KNN allows Zillow to predict how much a home is worth with a high degree of accuracy.

This algorithm not only helps buyers and sellers make more informed decisions but also enhances the credibility of Zillow as a trusted source for real estate pricing. The Zestimate tool has become one of the most popular features on Zillow, offering users an instant estimate of home values. By grouping homes based on KNN, Zillow can analyse market trends and predict pricing fluctuations based on historical data and comparable sales, providing value not only to individual buyers but also to the broader real estate market.

Tinder: Match Recommendations

Tinder, one of the most popular dating apps, uses KNN to enhance its match recommendation system. In online dating, providing relevant match suggestions is essential to keeping users engaged and improving their chances of finding meaningful connections. KNN helps Tinder compare users based on their behaviour, such as swiping patterns, age preferences, and interests. By identifying "neighbors" who have similar profiles and dating behaviours, KNN allows Tinder to recommend users who are likely to be good matches for each other.

This method helps Tinder users discover potential partners who align with their preferences, increasing user satisfaction. The more relevant the matches, the longer users stay active on the app, and the higher the likelihood of successful matches. KNN’s ability to analyse multiple factors and find patterns in user behaviour has led to improved match quality, which is one of the reasons Tinder remains a leader in the dating app market. Personalized matches driven by KNN have boosted user retention, helping Tinder grow its global user base and monetize through premium subscriptions.

Alibaba: Fraud Detection

Alibaba, the Chinese e-commerce giant, leverages KNN in its fraud detection system. With millions of transactions happening daily, identifying fraudulent activities is crucial for maintaining trust and security on the platform. KNN helps Alibaba analyse customer behaviour patterns and transaction histories to detect anomalies. By comparing a current transaction with past similar transactions, KNN identifies unusual behaviours, such as a sudden spike in purchases from a specific account, abnormal payment methods, or changes in delivery locations.

When such anomalies are detected, Alibaba’s system flags them for further investigation. KNN is particularly useful in this context because it can compare transactions across various dimensions (e.g., frequency, amount, location) and find the ones that deviate significantly from normal behaviour. By using KNN for fraud detection, Alibaba has improved its ability to prevent fraud in real-time, protecting both buyers and sellers from malicious activities. This has helped maintain trust in Alibaba’s platform while minimizing financial losses due to fraud.

Pinterest: Image Recognition for Visual Search

Pinterest, a leading platform for sharing visual content, uses KNN to power its image recognition and visual search capabilities. With users heavily reliant on image discovery, Pinterest needed a way to help users find similar images and content related to their interests. KNN allows Pinterest to analyze visual features such as colors, shapes, and patterns within images. When a user pins an image or performs a visual search, KNN compares the new image to a vast database of other images, grouping them based on similarity. This helps Pinterest deliver highly accurate and visually relevant search results.

The visual search enhancement brought by KNN ensures that users can easily find aesthetically or contextually similar content, improving the overall user experience. For instance, if a user pins a home décor idea, KNN suggests similar décor designs that match the user’s tastes. By facilitating such personalized discovery, Pinterest keeps users engaged longer, encouraging them to explore more content. This has directly contributed to increased user retention and engagement, as users spend more time on the platform discovering new ideas. As a result, Pinterest has become a go-to platform for visual content discovery, driving higher user interaction and more frequent platform visits.

How KNN Works: Step by Step

Let’s break down how KNN works into simple steps:

While machine learning algorithms like KNN are often implemented in Python or R using libraries such as Scikit-learn, you can also gain a fundamental understanding of KNN through MS Excel. Excel allows you to manually perform each step of the algorithm, providing hands-on experience with KNN calculations.

Problem Statement: Predicting Whether a Customer Will Purchase a Product

You are the marketing manager of an e-commerce platform and want to predict whether a customer will purchase a product based on two features: Age and Income. You have a dataset of past customer purchases. Your task is to predict whether a new customer (Age = 40, Income = 58000) will purchase the product or not using the KNN algorithm.

Step 1: Dataset Setup

In Excel, we will create a small dataset with the following features:

Age: The age of the customer.
Income: The annual income of the customer.
Purchased: 1 if the customer purchased the product, 0 if not.

Now, your task is to predict whether the new customer (Age = 40, Income = 58000) will purchase the product based on the purchasing patterns of previous customer

Step 2 : Calculate Euclidean Distance

KNN works by calculating the Euclidean distance between the new customer and each of the other customers in the dataset.

Euclidean Distance Formula:

Where p1 and p2 are the Age and Income of the new customer, and q1 and q2 are the Age and Income of an existing customer.

Steps in Excel:

Add a new column titled Distance.
Use the following formula to calculate the Euclidean distance between the new customer and each existing customer =SQRT((B2 - 40)^2 + (C2 - 58000)^2)
Drag the formula down for all rows to calculate the distance for each customer.

Step 3: Sort the Data by Distance

Once the distances are calculated, sort the data in ascending order of the distance to find the closest neighbors.Highlight the entire dataset (including the distances).Go to Data → Sort → Sort by Distance (ascending).

Step 4: Select the Nearest Neighbors (K = 3)

In this example, let's assume K = 3, meaning we will look at the three closest customers to the new customer. After sorting by distance, pick the top 3 nearest neighbors. Look at their Purchased values (either 0 or 1).

Step 5: Predict the Class (Purchased or Not)

Once you've identified the three nearest neighbors, count how many of the nearest neighbors purchased the product (Purchased = 1) and how many did not (Purchased = 0).

Use the COUNTIF function in Excel to count how many of the neighbours have Purchased = 1: =COUNTIF(D2:D4, 1) ; where D2:D4 represents the range of the Purchased column for the nearest neighbours.

Prediction:

If the majority of the nearest neighbors purchased the product, predict that the new customer will purchase the product (i.e., Purchased = 1).
If the majority did not purchase the product, predict that the new customer will not purchase the product (i.e., Purchased = 0).

Step 6: Conclusion

Once you've completed these steps, you’ll have predicted whether the new customer (Age = 40, Income = 58000) will purchase the product based on the purchasing behaviour of the three nearest neighbors.

Worked Example (Results)

For simplicity, let’s calculate the Euclidean distances manually here for each customer:

After sorting by distance, the nearest neighbors are :

Customer 2 (Age = 32, Income = 60000, Purchased = 1)
Customer 5 (Age = 23, Income = 55000, Purchased = 1)
Customer 1 (Age = 25, Income = 50000, Purchased = 1)

Here is the visual representation of the KNN problem for predicting whether a customer will purchase a product based on their Age and Income:

· Green points represent customers who purchased the product.

· Red points represent customers who did not purchase the product.

· Blue point represents the new customer (Age = 40, Income = 58000) for whom we want to make a prediction.

Using KNN, you can now see how the new customer is positioned relative to the existing customers. By calculating the Euclidean distance from the new customer to the others, you can predict whether they are likely to purchase the product by observing their nearest neighbors.

Prediction :

All three nearest neighbors purchased the product (Purchased = 1).
Therefore, we predict that the new customer will also purchase the product (Purchased = 1).

K-Nearest Neighbors (KNN) & Regression

K-Nearest Neighbors (KNN) can also be applied to regression tasks! In KNN regression, instead of predicting a class label (as in classification), KNN predicts a numerical value by averaging the values of the K nearest neighbors.

Let's walk through how we can apply KNN for regression in Excel with a step-by-step problem, just as we did for classification.

Problem Statement: Predicting House Prices Using KNN Regression

You are a real estate analyst, and you want to predict the price of a house based on its size (in square feet) and the number of bedrooms. You have data from previously sold houses, and you want to use KNN to predict the price of a new house that has not been sold yet.