Data Preparation Guide

Main Workflow

Download the dataset from the link provided in the table below.
Unzip the dataset and put it in the Any directory.
- Make sure all data files are in the same directory.
- Necessary files for Yelp:
  - yelp_academic_dataset_business.json,
  - yelp_academic_dataset_user.json,
  - yelp_academic_dataset_review.json
- Necessary files for Amazon:
  - Industrial_and_Scientific.csv,
  - Musical_Instruments.csv,
  - Video_Games.csv,
  - Industrial_and_Scientific.jsonl,
  - Musical_Instruments.jsonl,
  - Video_Games.jsonl,
  - meta_Industrial_and_Scientific.jsonl,
  - meta_Musical_Instruments.jsonl,
  - meta_Video_Games.jsonl
- Necessary files for Goodreads:
  - goodreads_books_children.json,
  - goodreads_reviews_children.json,
  - goodreads_books_comics_graphic.json,
  - goodreads_reviews_comics_graphic.json,
  - goodreads_books_poetry.json,
  - goodreads_reviews_poetry.json
Run the data_process.py script to prepare the data for the simulation and recommendation tasks.

python data_process.py --input <path_to_raw_dataset> --output <path_to_processed_dataset>

	len(review)	len(business)	len(user)	link
Yelp	-	-	-	download
-	-	-	-
Industrial_and_Scientific	412.9K	25.8K	51.0K	review, meta, rating_only
Musical_Instruments	511.8K	24.6K	57.4K	review, meta, rating_only
Video_Games	814.6K	25.6K	94.8K	review, meta, rating_only
Amazon	1,739,300	76,000
-	-	-	-
Goodreads _Children	734,640	124,082		review, meta
Goodreads _Comics&Graphic	542,338	89,411		review, meta
Goodreads _Poetry	154,555	36,514		review, meta
Goodreads	1,431,533	250,007