K-Nearest Neighbors (KNN) Classifier from Scratch

Overview

This project implements the K-Nearest Neighbors (KNN) classification algorithm from scratch using Python and NumPy. The goal is to classify data points based on their nearest neighbors and evaluate the accuracy of the classification.

Features

Implementation of the KNN algorithm without using external machine learning libraries.
Support for loading and processing dataset files (CSV format).
Customizable number of neighbors (K) for classification.
Computation of Euclidean distance for similarity measurement.
Evaluation of accuracy by comparing predicted labels with actual test labels.
Visualization of dataset using Matplotlib.

How It Works

Data Loading: The dataset is loaded from CSV files containing labeled feature data.
Training: The training dataset is stored for future reference (no explicit training process since KNN is a lazy learner).
Prediction: For each test sample:
- Compute the Euclidean distance between the test sample and all training samples.
- Identify the K closest points (neighbors) in the training dataset.
- Determine the most frequent label among the K nearest neighbors.
- Assign this label to the test sample as the predicted class.
Evaluation: The predicted labels are compared with the actual labels from the test set to compute accuracy.

Implementation Details

Euclidean Distance Function

Computes the Euclidean distance between two points:

import numpy as np

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

KNN Classifier

from collections import Counter

class KNN:
    def __init__(self, k=3):
        self.k = k
    
    def fit(self, X, y):
        self.X_train = X
        self.y_train = y
    
    def predict(self, X):
        return [self._predict(x) for x in X]
    
    def _predict(self, x):
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

Accuracy Calculation

def calculate_accuracy(y_true, y_pred):
    correct = sum(1 for true, pred in zip(y_true, y_pred) if true == pred)
    return correct / len(y_true)

Usage

Load the training and test datasets from CSV files.
Initialize the KNN classifier with a chosen value of K.
Train the classifier using fit(train_data, train_labels).
Predict the labels of test samples using predict(test_data).
Compute the accuracy of the predictions.

Example Usage

# Load training data
train_data, train_labels = read_csv('trainYX.csv')

# Load test data
test_data, test_labels = read_csv('testYX.csv')

# Initialize KNN classifier with k=5
clf = KNN(k=5)
clf.fit(train_data, train_labels)

# Predict on test data
predictions = clf.predict(test_data)

# Calculate accuracy
accuracy = calculate_accuracy(test_labels, predictions)
print(f'Accuracy: {accuracy:.4f}')

Visualization

A sample dataset (Iris dataset) is visualized using Matplotlib to illustrate how the KNN classifier distinguishes between different classes.

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap

# Load the Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Scatter plot of the dataset
cmap = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
plt.scatter(X[:, 2], X[:, 3], c=y, cmap=cmap, edgecolor='k', s=20)
plt.show()

Explanation of Accuracy Calculation

The KNN model predicts the label for each test data point.
The predicted labels are compared with the actual test labels.
The accuracy is computed as:

[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Test Samples}} ]
This provides an evaluation of how well the KNN classifier performs on the given dataset.

Possible Improvements

Implementing weighted KNN, where closer neighbors have higher influence.
Optimizing distance calculations for large datasets.
Exploring different distance metrics such as Manhattan distance.
Using KD-Trees or Ball Trees for faster neighbor searches.

Conclusion

This project demonstrates how the K-Nearest Neighbors algorithm works, providing a hands-on approach to understanding distance-based classification. It can be extended for various applications, including image classification and recommendation systems.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
KNN.py		KNN.py
KNN_milestone3.py		KNN_milestone3.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-Nearest Neighbors (KNN) Classifier from Scratch

Overview

Features

How It Works

Implementation Details

Euclidean Distance Function

KNN Classifier

Accuracy Calculation

Usage

Example Usage

Visualization

Explanation of Accuracy Calculation

Possible Improvements

Conclusion

About

Releases

Packages

Languages

Taekil/KNNClassifier

Folders and files

Latest commit

History

Repository files navigation

K-Nearest Neighbors (KNN) Classifier from Scratch

Overview

Features

How It Works

Implementation Details

Euclidean Distance Function

KNN Classifier

Accuracy Calculation

Usage

Example Usage

Visualization

Explanation of Accuracy Calculation

Possible Improvements

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages