Machine Learning with K-Nearest Neighbors (KNN) using sci-kit-learn
- Mahalakshmi Adabala
- Aug 2, 2023
- 5 min read
Machine learning is a rapidly evolving field that enables computers to learn patterns and make intelligent decisions based on data. One of the simplest yet effective algorithms in machine learning is the K-Nearest Neighbors (KNN) algorithm. KNN is a supervised learning algorithm used for classification and regression tasks. In this blog, we will explore the fundamentals of the KNN algorithm and implement it using the popular Python library, sci-kit-learn.

Understanding the K-Nearest Neighbors Algorithm
The K-Nearest Neighbors algorithm is based on the principle that similar data points tend to belong to the same class. In other words, the algorithm makes predictions by finding the K closest data points to a given query point and then determines the majority class among those K neighbors for classification tasks or computes the average for regression tasks.
Here's a step-by-step breakdown of the KNN algorithm:
1. Load the Data: First, we need a labeled dataset that contains samples with known classes for training our model.
2. Choose the Value of K: The hyperparameter "K" represents the number of nearest neighbors to consider when making a prediction. It's crucial to select an appropriate value for K, as it can significantly impact the algorithm's performance.
3. Calculate Distances: For each data point in the dataset, the algorithm calculates the distance (e.g., Euclidean distance) between the data point and the query point for which we want to make a prediction.
4. Select K Neighbors: The K nearest data points to the query point are selected based on the calculated distances.
5. Majority Vote or Averaging: For classification tasks, the algorithm predicts the class that occurs most frequently among the K neighbors. For regression tasks, it predicts the average value of the target variable for the K neighbors.
6. Make Predictions: The algorithm uses the majority vote or averaging to make predictions for the query point.
Implementing K-Nearest Neighbors with sci-kit-learn
Now, let's walk through an example of implementing K-Nearest Neighbors using sci-kit-learn, a powerful Python library for machine learning.
Step 1: Installing sci-kit-learn
Before we start, make sure you have sci-kit-learn installed. If not, you can install it using pip:
pip install sci-kit-learnStep 2: Importing Necessary Libraries
Let's import the required libraries for our implementation:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn. metrics import accuracy_scoreStep 3: Load and Preprocess the Data
For this example, we will use the famous Iris dataset available in sci-kit-learn, which contains samples of iris flowers along with their species labels. Let's load the data and split it into training and testing sets:
#python
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)
Step 4: Create and Train the KNN Model
Now, we can create a KNN classifier and train it on our training data:
#python
# Create a KNN classifier with K=3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model 
known.fit(X_train, y_train)Step 5: Make Predictions and Evaluate the Model
Finally, we can use our trained model to make predictions on the test set and evaluate its performance:
#python
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
K-Nearest Neighbors is a simple yet powerful machine-learning algorithm for classification and regression tasks. In this blog, we explored the basics of the KNN algorithm and implemented it using sci-kit-learn with Python. Remember to choose the right value of K and preprocess your data appropriately to achieve better results. KNN is just one of the many algorithms available in the vast world of machine learning, and mastering it is a stepping stone toward building more complex and sophisticated models. Happy learning and experimenting!
What are the Pros and Cons of KNN?
K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm, but like any other algorithm, it has its strengths and weaknesses. Let's explore the pros and cons of KNN:
Pros:
- Simple and Easy to Implement: KNN is straightforward to understand and implement, making it a great starting point for beginners in machine learning. 
- No Training Phase: Unlike other algorithms that require extensive training on the dataset, KNN is instance-based and lazy learning. It doesn't have a separate training phase and uses the entire dataset for making predictions. 
- Versatile: KNN can be used for both classification and regression tasks, making it adaptable to various types of problems. 
- Non-Parametric: KNN is a non-parametric algorithm, which means it makes no assumptions about the underlying data distribution. This makes it effective for complex and nonlinear relationships. 
- Interpretable: The KNN algorithm's decision-making process is transparent and easy to interpret since it relies on the closest data points. 
- No Model Building: KNN doesn't build an explicit model during the training phase, which can save computational time and resources. 
Cons:
- Computational Complexity: The main drawback of KNN is its computational complexity during the prediction phase. As the dataset grows larger, the time required to make predictions increases significantly. 
- Memory Usage: KNN needs to store the entire dataset in memory for prediction, which can be a problem when dealing with large datasets. 
- Choosing the Right K: Selecting an appropriate value for K is crucial. A small K might lead to overfitting, while a large K can lead to underfitting. Determining the optimal K value often requires experimentation. 
- Sensitive to Noise and Outliers: KNN is sensitive to noisy data and outliers. Outliers can heavily influence the prediction, leading to potentially inaccurate results. 
- Distance Metric Selection: The choice of distance metric in KNN (e.g., Euclidean, Manhattan) can significantly impact the algorithm's performance. The distance metric should be chosen carefully based on the nature of the data. 
- Imbalanced Data: In classification tasks with imbalanced classes, KNN tends to favor the majority class, leading to biased predictions. 
- Curse of Dimensionality: As the number of features (dimensions) increases, the performance of KNN can degrade, as the notion of distance becomes less meaningful in high-dimensional spaces. 
In summary, KNN is a powerful and flexible algorithm with its simplicity and versatility, but it may not always be the best choice for large datasets or high-dimensional data.
Understanding the trade-offs and characteristics of KNN can help you make informed
decisions about when to use it and when to consider alternative algorithms.
Conclusion
K-Nearest Neighbors is a simple yet powerful machine-learning algorithm for classification and regression tasks. In this blog, we explored the basics of the KNN algorithm and implemented it using sci-kit-learn with Python. Remember to choose the right value of K and preprocess your data appropriately to achieve better results. KNN is just one of the many algorithms available in the vast world of machine learning, and mastering it is a stepping stone toward building more complex and sophisticated models. Happy learning and experimenting! 
Author - Vandita Chauhan
