A Multi-Objective Ecommerce Recommender System

This repository provides the implementation of a multi-objective recommender system using candidate ranker models, optimized for large-scale e-commerce datasets, to predict user interactions such as clicks, carts, and orders.

Introduction

This repository contains code created during the Kaggle competition, OTTO - Multi-Objective Recommender System, held in January 2023 (Kaggle Profile for Myles Dunlap). My approach in this repo. allowed me to finish in the top 2% of 2,587 competing teams.

The objective of the competition was to predict e-commerce clicks, carts, and orders for a one-week period. The main approach I used was a Candidate Ranker model, commonly employed in large dataset recommender systems. Candidate reranker models are explained in the References section [1, 2].

The basic concept behind a candidate reranker model is illustrated in Figure 1 below.

Figure 1: A Large Dataset Candidate Ranker Process [1]

In candidate ranker models, the item corpus—often containing millions or billions of items—needs to be reduced so that items can be properly recommended to users. The process involves generating a set of candidate items (see Candidates Generation). For this competition, hand-crafted rules, Word2Vec, and Bayesian Personalized Ranking (BPR) were used to generate variable amounts of candidate-item pairs. The code for generating the candidates is located in the files co_vis_matrices.py and candidates.py.

Co-visitation matrices (source) describe products frequently viewed and bought together. Three co-visitation matrices were used to provide similar items that were:

Clicked
Clicked/Carted/Ordered
Bought [3].

The RAPIDS cuDF GPU data science library was used to quickly process these matrices instead of relying on multi-core dataframe libraries.

Data

The competition provided a large dataset for training and testing, with a time span of approximately 6 weeks, sourced from OTTO, the largest German online shop. The dataset contains the following data points (in millions):

12.9M sessions (user activity)
1.8M items (products)
194.7M clicks
16.8M carts
5.1M orders

The goal was to recommend the next 20 items a user would: click, cart, and order.

Libraries

Item and User Similarities:
- Word2Vec: Used for generating item similarities.
- Implicit: Utilized for Bayesian Personalized Ranking (BPR).
  - Research Paper: BPR: Bayesian Personalized Ranking from Implicit Feedback
RAPIDS AI: GPU acceleration for data science operations (e.g., dataframes).
POLARS: Fast dataframe library written in Rust. It supports lazy execution and is optimized for multi-threaded operations.
XGBoost: XGBoost was used to build an XGB Ranker model. The model used a pairwise objective function to refine item recommendations from the candidate generation stage. The XGB Ranker model code can be found in xgb_ranker.py. Additionally, the XGB model was trained on the GPU.

Model Setup Example:

# Model Setup
xgb_params = {'objective': 'rank:pairwise',
              'tree_method': 'gpu_hist',
              'learning_rate': hps.learning_rate,
              'max_depth': hps.max_depth,
              'colsample_bytree': hps.colsample_bytree,
             }
model = xgb.train(xgb_params, 
                  dtrain=dtrain,
                  evals=[(dtrain, 'train'),
                         (dvalid, 'valid')],
                  num_boost_round=hps.num_boost_round,
                  verbose_eval=hps.verbose_eval,
                  early_stopping_rounds=hps.early_stopping_rounds,
                 )

Due to the magnitude of the dataset, Out-of-Memory errors (e.g., insufficient RAM) were a common issue. To overcome this, data was loaded from disk in chunks, and the XGB model was trained in intervals. This technique is implemented in the following code, found in xgb_ranker.py.

class IterLoadForDMatrix(xgb.core.DataIter):
    def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
        self.features = features
        self.target = target
        self.df = df
        self.it = 0 # set iterator to 0
        self.batch_size = batch_size
        self.batches = int( np.ceil( len(df) / self.batch_size ) )
        super().__init__()

    def reset(self):
        '''Reset the iterator'''
        self.it = 0

    def next(self, input_data):
        '''Yield next batch of data.'''
        if self.it == self.batches:
            return 0 # Return 0 when there's no more batch.
        
        a = self.it * self.batch_size
        b = min( (self.it + 1) * self.batch_size, len(self.df) )
        dt = cudf.DataFrame(self.df.iloc[a:b])
        input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
        self.it += 1
        return 1


def stratify_folds(df):
    # Stratify Group K-Fold
    skf = GroupKFold(n_splits=5)

    strat = {}
    for fold,(train_idx, valid_idx) in enumerate(skf.split(df, df['target'], groups=df['session'])):
        strat[fold] = {'fold': fold,
                       'train_idx': train_idx,
                       'valid_idx': valid_idx,
                       'num_rows': len(df)}
    return strat

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
cfgs		cfgs
imgs		imgs
input		input
notebooks		notebooks
scripts		scripts
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Multi-Objective Ecommerce Recommender System

Introduction

Data

Libraries

Model Setup Example:

References

About

Releases

Packages

Languages

mddunlap924/Ecommerce-Recommender-System

Folders and files

Latest commit

History

Repository files navigation

A Multi-Objective Ecommerce Recommender System

Introduction

Data

Libraries

Model Setup Example:

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages