A high-performance, TypeScript application for scraping politeianet.gr
online bookstore. Scrapes data with robust error handling, intelligent concurrency, and comprehensive logging.
-
Intelligent Concurrency
- Adaptive batch processing with dynamic sizing
- Rate limiting to respect server constraints
- Smart queue management with deduplication
- Concurrent processing with configurable limits
-
Robust Error Handling
- Automatic retries with exponential backoff
- Custom error types for different failure scenarios
- Comprehensive error tracking and logging
- Graceful error recovery in batch processing
-
Performance Optimizations
- Event throttling to prevent system overload
- Efficient memory usage with Set data structures
- Moving average calculations for processing statistics
- Adaptive concurrency based on system performance
- Node.js (v16 or higher)
- TypeScript
- npm or yarn
- Clone the repository
- Install dependencies:
npm install
Create a .env
file in the project root with your configuration:
BASE_URL=https://www.politeianet.gr/
OUTPUT_FILE=books.csv
BOOK_LIST_PATH=/index.php?option=com_virtuemart&Itemid=506
HEADLESS=true
MAX_CONCURRENT=5
RATE_LIMIT_PER_MINUTE=30
The scraper operates in two phases:
Scrapes all book links from the listing pages
Processes the collected links to gather detailed book information. Features automatic batch retry with exponential backoff if a batch fails:
- First retry: 5 second delay
- Second retry: 10 second delay
- Third retry: 20 second delay
- After 3 failed attempts, marks batch as failed and continues with next batch
src/
├── services/ # Core services
│ ├── browser.ts # Browser automation service
│ ├── linkQueue.ts # Queue management service
│ └── storage.ts # Data persistence service
├── config.ts # Configuration management
├── detailsScraper.ts # Book details scraping logic
├── linkScraper.ts # Book links collection logic
├── logger.ts # Logging implementation
├── types.ts # TypeScript type definitions
└── utils.ts # Utility functions
The scraper generates a CSV file with the following information for each book:
- Title
- Author
- Number of recommendations
- Source URL
- Scraping timestamp
ISC