Machine Learning with K-Nearest Neighbors (KNN) using sci-kit-learn

Mahalakshmi Adabala
Aug 2, 2023
5 min read

Machine learning is a rapidly evolving field that enables computers to learn patterns and make intelligent decisions based on data. One of the simplest yet effective algorithms in machine learning is the K-Nearest Neighbors (KNN) algorithm. KNN is a supervised learning algorithm used for classification and regression tasks. In this blog, we will explore the fundamentals of the KNN algorithm and implement it using the popular Python library, sci-kit-learn.

Understanding the K-Nearest Neighbors Algorithm

The K-Nearest Neighbors algorithm is based on the principle that similar data points tend to belong to the same class. In other words, the algorithm makes predictions by finding the K closest data points to a given query point and then determines the majority class among those K neighbors for classification tasks or computes the average for regression tasks.

Here's a step-by-step breakdown of the KNN algorithm:

1. Load the Data: First, we need a labeled dataset that contains samples with known classes for training our model.

2. Choose the Value of K: The hyperparameter "K" represents the number of nearest neighbors to consider when making a prediction. It's crucial to select an appropriate value for K, as it can significantly impact the algorithm's performance.

3. Calculate Distances: For each data point in the dataset, the algorithm calculates the distance (e.g., Euclidean distance) between the data point and the query point for which we want to make a prediction.

4. Select K Neighbors: The K nearest data points to the query point are selected based on the calculated distances.

5. Majority Vote or Averaging: For classification tasks, the algorithm predicts the class that occurs most frequently among the K neighbors. For regression tasks, it predicts the average value of the target variable for the K neighbors.

6. Make Predictions: The algorithm uses the majority vote or averaging to make predictions for the query point.

Implementing K-Nearest Neighbors with sci-kit-learn

Now, let's walk through an example of implementing K-Nearest Neighbors using sci-kit-learn, a powerful Python library for machine learning.

Step 1: Installing sci-kit-learn

Before we start, make sure you have sci-kit-learn installed. If not, you can install it using pip:


pip install sci-kit-learn

Step 2: Importing Necessary Libraries

Let's import the required libraries for our implementation:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn. metrics import accuracy_score

Step 3: Load and Preprocess the Data

For this example, we will use the famous Iris dataset available in sci-kit-learn, which contains samples of iris flowers along with their species labels. Let's load the data and split it into training and testing sets:

#python
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)

Step 4: Create and Train the KNN Model

Now, we can create a KNN classifier and train it on our training data:

#python
# Create a KNN classifier with K=3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model 
known.fit(X_train, y_train)

Step 5: Make Predictions and Evaluate the Model

Finally, we can use our trained model to make predictions on the test set and evaluate its performance:

#python
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

K-Nearest Neighbors is a simple yet powerful machine-learning algorithm for classification and regression tasks. In this blog, we explored the basics of the KNN algorithm and implemented it using sci-kit-learn with Python. Remember to choose the right value of K and preprocess your data appropriately to achieve better results. KNN is just one of the many algorithms available in the vast world of machine learning, and mastering it is a stepping stone toward building more complex and sophisticated models. Happy learning and experimenting!

What are the Pros and Cons of KNN?

K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm, but like any other algorithm, it has its strengths and weaknesses. Let's explore the pros and cons of KNN:

Pros:

Simple and Easy to Implement: KNN is straightforward to understand and implement, making it a great starting point for beginners in machine learning.
No Training Phase: Unlike other algorithms that require extensive training on the dataset, KNN is instance-based and lazy learning. It doesn't have a separate training phase and uses the entire dataset for making predictions.
Versatile: KNN can be used for both classification and regression tasks, making it adaptable to various types of problems.
Non-Parametric: KNN is a non-parametric algorithm, which means it makes no assumptions about the underlying data distribution. This makes it effective for complex and nonlinear relationships.
Interpretable: The KNN algorithm's decision-making process is transparent and easy to interpret since it relies on the closest data points.
No Model Building: KNN doesn't build an explicit model during the training phase, which can save computational time and resources.

Cons:

Computational Complexity: The main drawback of KNN is its computational complexity during the prediction phase. As the dataset grows larger, the time required to make predictions increases significantly.
Memory Usage: KNN needs to store the entire dataset in memory for prediction, which can be a problem when dealing with large datasets.
Choosing the Right K: Selecting an appropriate value for K is crucial. A small K might lead to overfitting, while a large K can lead to underfitting. Determining the optimal K value often requires experimentation.
Sensitive to Noise and Outliers: KNN is sensitive to noisy data and outliers. Outliers can heavily influence the prediction, leading to potentially inaccurate results.
Distance Metric Selection: The choice of distance metric in KNN (e.g., Euclidean, Manhattan) can significantly impact the algorithm's performance. The distance metric should be chosen carefully based on the nature of the data.
Imbalanced Data: In classification tasks with imbalanced classes, KNN tends to favor the majority class, leading to biased predictions.
Curse of Dimensionality: As the number of features (dimensions) increases, the performance of KNN can degrade, as the notion of distance becomes less meaningful in high-dimensional spaces.

In summary, KNN is a powerful and flexible algorithm with its simplicity and versatility, but it may not always be the best choice for large datasets or high-dimensional data.

Understanding the trade-offs and characteristics of KNN can help you make informed

decisions about when to use it and when to consider alternative algorithms.

Conclusion

K-Nearest Neighbors is a simple yet powerful machine-learning algorithm for classification and regression tasks. In this blog, we explored the basics of the KNN algorithm and implemented it using sci-kit-learn with Python. Remember to choose the right value of K and preprocess your data appropriately to achieve better results. KNN is just one of the many algorithms available in the vast world of machine learning, and mastering it is a stepping stone toward building more complex and sophisticated models. Happy learning and experimenting!

Author - Vandita Chauhan

Machine Learning with K-Nearest Neighbors (KNN) using sci-kit-learn

Understanding the K-Nearest Neighbors Algorithm

Here's a step-by-step breakdown of the KNN algorithm:

1. Load the Data: First, we need a labeled dataset that contains samples with known classes for training our model.

2. Choose the Value of K: The hyperparameter "K" represents the number of nearest neighbors to consider when making a prediction. It's crucial to select an appropriate value for K, as it can significantly impact the algorithm's performance.

3. Calculate Distances: For each data point in the dataset, the algorithm calculates the distance (e.g., Euclidean distance) between the data point and the query point for which we want to make a prediction.

4. Select K Neighbors: The K nearest data points to the query point are selected based on the calculated distances.

5. Majority Vote or Averaging: For classification tasks, the algorithm predicts the class that occurs most frequently among the K neighbors. For regression tasks, it predicts the average value of the target variable for the K neighbors.

6. Make Predictions: The algorithm uses the majority vote or averaging to make predictions for the query point.

Implementing K-Nearest Neighbors with sci-kit-learn

Now, let's walk through an example of implementing K-Nearest Neighbors using sci-kit-learn, a powerful Python library for machine learning.

Step 1: Installing sci-kit-learn

Before we start, make sure you have sci-kit-learn installed. If not, you can install it using pip:

Step 2: Importing Necessary Libraries

Let's import the required libraries for our implementation:

Step 3: Load and Preprocess the Data

For this example, we will use the famous Iris dataset available in sci-kit-learn, which contains samples of iris flowers along with their species labels. Let's load the data and split it into training and testing sets:

Step 4: Create and Train the KNN Model

Now, we can create a KNN classifier and train it on our training data:

Step 5: Make Predictions and Evaluate the Model

Finally, we can use our trained model to make predictions on the test set and evaluate its performance:

What are the Pros and Cons of KNN?

K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm, but like any other algorithm, it has its strengths and weaknesses. Let's explore the pros and cons of KNN:

Pros:

Cons:

In summary, KNN is a powerful and flexible algorithm with its simplicity and versatility, but it may not always be the best choice for large datasets or high-dimensional data.

Understanding the trade-offs and characteristics of KNN can help you make informed

decisions about when to use it and when to consider alternative algorithms.

Conclusion

Recent Posts

Comments