Mosquito Species Classification System

Overview

This project implements a machine learning system for classifying mosquito species based on their genomic and morphological features. It uses a Random Forest classifier optimized with GridSearchCV to achieve high classification accuracy.

Purpose

Mosquito species identification is critical for:

Disease vector surveillance and control
Ecological studies
Public health initiatives

This system provides an automated way to classify mosquito species using their genetic markers and physical characteristics.

Project Structure

mosquito_classification/
├── create_sample_data.py    # Generates synthetic genomic data
├── mosquito_classifier.py   # Main classification script
├── mosquito_genomic_data.csv # Dataset (generated)
└── README.md                # This documentation

Requirements

This project requires:

Python 3.x
pandas
scikit-learn
numpy

These packages are installed in a virtual environment to avoid conflicts with system packages.

Installation

Clone or download this repository

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install required packages:
```
pip install pandas scikit-learn numpy
```

Usage

Generate Sample Data

If you don't have your own dataset, you can generate synthetic data:

python3 create_sample_data.py

This creates a file called mosquito_genomic_data.csv with 500 samples across 4 species.

Run Classification

To train and evaluate the classification model:

python3 mosquito_classifier.py

The script will:

Load the dataset
Split it into training and testing sets
Train a Random Forest model using grid search for hyperparameter optimization
Evaluate model performance
Display the most important features

Dataset Information

The dataset contains the following features:

Gene expression data: 10 gene expression levels (gene_expr_1 to gene_expr_10)
SNP data: 20 single nucleotide polymorphisms (snp_1 to snp_20)
Morphological features:
- body_length
- wing_width
- proboscis_length
- leg_length
- thorax_width
Target variable: species (Anopheles_gambiae, Aedes_aegypti, Culex_pipiens, Anopheles_stephensi)

Model Details

The Random Forest classifier is optimized using GridSearchCV with the following parameters:

n_estimators: [100, 200]
max_depth: [None, 10, 20]
min_samples_split: [2, 5]

The best parameters found are:

max_depth: None
min_samples_split: 2
n_estimators: 200

Results

The model achieves:

Cross-validation accuracy: 99%
Test set accuracy: 100%

Top features for classification:

body_length
gene_expr_7
gene_expr_6
leg_length
gene_expr_4

Future Improvements

Potential enhancements include:

Model persistence functionality
Interactive prediction for new samples
Additional visualization of results
Support for more species
Feature selection techniques
Testing with real-world data

License

This project is provided as open-source software for educational and research purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mosquito Species Classification System

Overview

Purpose

Project Structure

Requirements

Installation

Usage

Generate Sample Data

Run Classification

Dataset Information

Model Details

Results

Future Improvements

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
README.md		README.md
create_sample_data.py		create_sample_data.py
mosquito_classifier.py		mosquito_classifier.py
mosquito_genomic_data.csv		mosquito_genomic_data.csv
pyvenv.cfg		pyvenv.cfg

krishn-cpu/Mosquito_classifier

Folders and files

Latest commit

History

Repository files navigation

Mosquito Species Classification System

Overview

Purpose

Project Structure

Requirements

Installation

Usage

Generate Sample Data

Run Classification

Dataset Information

Model Details

Results

Future Improvements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages