Book Recommendations Scraper

A high-performance, TypeScript application for scraping politeianet.gr online bookstore. Scrapes data with robust error handling, intelligent concurrency, and comprehensive logging.

Features

Intelligent Concurrency
- Adaptive batch processing with dynamic sizing
- Rate limiting to respect server constraints
- Smart queue management with deduplication
- Concurrent processing with configurable limits
Robust Error Handling
- Automatic retries with exponential backoff
- Custom error types for different failure scenarios
- Comprehensive error tracking and logging
- Graceful error recovery in batch processing
Performance Optimizations
- Event throttling to prevent system overload
- Efficient memory usage with Set data structures
- Moving average calculations for processing statistics
- Adaptive concurrency based on system performance

Prerequisites

Node.js (v16 or higher)
TypeScript
npm or yarn

Installation

Clone the repository
Install dependencies:

npm install

Configuration

Create a .env file in the project root with your configuration:

BASE_URL=https://www.politeianet.gr/
OUTPUT_FILE=books.csv
BOOK_LIST_PATH=/index.php?option=com_virtuemart&Itemid=506
HEADLESS=true
MAX_CONCURRENT=5
RATE_LIMIT_PER_MINUTE=30

Usage

The scraper operates in two phases:

1. Collect Book Links

Scrapes all book links from the listing pages

2. Scrape Book Details

Processes the collected links to gather detailed book information. Features automatic batch retry with exponential backoff if a batch fails:

First retry: 5 second delay
Second retry: 10 second delay
Third retry: 20 second delay
After 3 failed attempts, marks batch as failed and continues with next batch

Project Structure

src/
├── services/           # Core services
│   ├── browser.ts     # Browser automation service
│   ├── linkQueue.ts   # Queue management service
│   └── storage.ts     # Data persistence service
├── config.ts          # Configuration management
├── detailsScraper.ts  # Book details scraping logic
├── linkScraper.ts     # Book links collection logic
├── logger.ts          # Logging implementation
├── types.ts           # TypeScript type definitions
└── utils.ts           # Utility functions

Output

The scraper generates a CSV file with the following information for each book:

Title
Author
Number of recommendations
Source URL
Scraping timestamp

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book Recommendations Scraper

Features

Prerequisites

Installation

Configuration

Usage

1. Collect Book Links

2. Scrape Book Details

Project Structure

Output

License

About

Releases

Packages

Languages

anavalo/polit

Folders and files

Latest commit

History

Repository files navigation

Book Recommendations Scraper

Features

Prerequisites

Installation

Configuration

Usage

1. Collect Book Links

2. Scrape Book Details

Project Structure

Output

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages