top of page
Search

Machine Learning with K-Nearest Neighbors (KNN) using sci-kit-learn

Machine learning is a rapidly evolving field that enables computers to learn patterns and make intelligent decisions based on data. One of the simplest yet effective algorithms in machine learning is the K-Nearest Neighbors (KNN) algorithm. KNN is a supervised learning algorithm used for classification and regression tasks. In this blog, we will explore the fundamentals of the KNN algorithm and implement it using the popular Python library, sci-kit-learn.



Understanding the K-Nearest Neighbors Algorithm

The K-Nearest Neighbors algorithm is based on the principle that similar data points tend to belong to the same class. In other words, the algorithm makes predictions by finding the K closest data points to a given query point and then determines the majority class among those K neighbors for classification tasks or computes the average for regression tasks.


Here's a step-by-step breakdown of the KNN algorithm:


1. Load the Data: First, we need a labeled dataset that contains samples with known classes for training our model.

2. Choose the Value of K: The hyperparameter "K" represents the number of nearest neighbors to consider when making a prediction. It's crucial to select an appropriate value for K, as it can significantly impact the algorithm's performance.

3. Calculate Distances: For each data point in the dataset, the algorithm calculates the distance (e.g., Euclidean distance) between the data point and the query point for which we want to make a prediction.

4. Select K Neighbors: The K nearest data points to the query point are selected based on the calculated distances.

5. Majority Vote or Averaging: For classification tasks, the algorithm predicts the class that occurs most frequently among the K neighbors. For regression tasks, it predicts the average value of the target variable for the K neighbors.

6. Make Predictions: The algorithm uses the majority vote or averaging to make predictions for the query point.


Implementing K-Nearest Neighbors with sci-kit-learn

Now, let's walk through an example of implementing K-Nearest Neighbors using sci-kit-learn, a powerful Python library for machine learning.


Step 1: Installing sci-kit-learn

Before we start, make sure you have sci-kit-learn installed. If not, you can install it using pip:



pip install sci-kit-learn


Step 2: Importing Necessary Libraries

Let's import the required libraries for our implementation:


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn. metrics import accuracy_score

Step 3: Load and Preprocess the Data

For this example, we will use the famous Iris dataset available in sci-kit-learn, which contains samples of iris flowers along with their species labels. Let's load the data and split it into training and testing sets:


#python
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)

Step 4: Create and Train the KNN Model

Now, we can create a KNN classifier and train it on our training data:


#python
# Create a KNN classifier with K=3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model 
known.fit(X_train, y_train)

Step 5: Make Predictions and Evaluate the Model

Finally, we can use our trained model to make predictions on the test set and evaluate its performance:


#python
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


K-Nearest Neighbors is a simple yet powerful machine-learning algorithm for classification and regression tasks. In this blog, we explored the basics of the KNN algorithm and implemented it using sci-kit-learn with Python. Remember to choose the right value of K and preprocess your data appropriately to achieve better results. KNN is just one of the many algorithms available in the vast world of machine learning, and mastering it is a stepping stone toward building more complex and sophisticated models. Happy learning and experimenting!


What are the Pros and Cons of KNN?

K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm, but like any other algorithm, it has its strengths and weaknesses. Let's explore the pros and cons of KNN:


Pros:

  • Simple and Easy to Implement: KNN is straightforward to understand and implement, making it a great starting point for beginners in machine learning.

  • No Training Phase: Unlike other algorithms that require extensive training on the dataset, KNN is instance-based and lazy learning. It doesn't have a separate training phase and uses the entire dataset for making predictions.

  • Versatile: KNN can be used for both classification and regression tasks, making it adaptable to various types of problems.

  • Non-Parametric: KNN is a non-parametric algorithm, which means it makes no assumptions about the underlying data distribution. This makes it effective for complex and nonlinear relationships.

  • Interpretable: The KNN algorithm's decision-making process is transparent and easy to interpret since it relies on the closest data points.

  • No Model Building: KNN doesn't build an explicit model during the training phase, which can save computational time and resources.


Cons:

  • Computational Complexity: The main drawback of KNN is its computational complexity during the prediction phase. As the dataset grows larger, the time required to make predictions increases significantly.

  • Memory Usage: KNN needs to store the entire dataset in memory for prediction, which can be a problem when dealing with large datasets.

  • Choosing the Right K: Selecting an appropriate value for K is crucial. A small K might lead to overfitting, while a large K can lead to underfitting. Determining the optimal K value often requires experimentation.

  • Sensitive to Noise and Outliers: KNN is sensitive to noisy data and outliers. Outliers can heavily influence the prediction, leading to potentially inaccurate results.

  • Distance Metric Selection: The choice of distance metric in KNN (e.g., Euclidean, Manhattan) can significantly impact the algorithm's performance. The distance metric should be chosen carefully based on the nature of the data.

  • Imbalanced Data: In classification tasks with imbalanced classes, KNN tends to favor the majority class, leading to biased predictions.

  • Curse of Dimensionality: As the number of features (dimensions) increases, the performance of KNN can degrade, as the notion of distance becomes less meaningful in high-dimensional spaces.


In summary, KNN is a powerful and flexible algorithm with its simplicity and versatility, but it may not always be the best choice for large datasets or high-dimensional data.

Understanding the trade-offs and characteristics of KNN can help you make informed

decisions about when to use it and when to consider alternative algorithms.


Conclusion

K-Nearest Neighbors is a simple yet powerful machine-learning algorithm for classification and regression tasks. In this blog, we explored the basics of the KNN algorithm and implemented it using sci-kit-learn with Python. Remember to choose the right value of K and preprocess your data appropriately to achieve better results. KNN is just one of the many algorithms available in the vast world of machine learning, and mastering it is a stepping stone toward building more complex and sophisticated models. Happy learning and experimenting!

Author - Vandita Chauhan



21 views0 comments

Comments


bottom of page