This repository provides the implementation of a multi-objective recommender system using candidate ranker models, optimized for large-scale e-commerce datasets, to predict user interactions such as clicks, carts, and orders.
This repository contains code created during the Kaggle competition, OTTO - Multi-Objective Recommender System, held in January 2023 (Kaggle Profile for Myles Dunlap). My approach in this repo. allowed me to finish in the top 2% of 2,587 competing teams.
The objective of the competition was to predict e-commerce clicks, carts, and orders for a one-week period. The main approach I used was a Candidate Ranker model, commonly employed in large dataset recommender systems. Candidate reranker models are explained in the References section [1, 2].
The basic concept behind a candidate reranker model is illustrated in Figure 1 below.
Figure 1: A Large Dataset Candidate Ranker Process [1]In candidate ranker models, the item corpus—often containing millions or billions of items—needs to be reduced so that items can be properly recommended to users. The process involves generating a set of candidate items (see Candidates Generation). For this competition, hand-crafted rules, Word2Vec, and Bayesian Personalized Ranking (BPR) were used to generate variable amounts of candidate-item pairs. The code for generating the candidates is located in the files co_vis_matrices.py and candidates.py.
Co-visitation matrices (source) describe products frequently viewed and bought together. Three co-visitation matrices were used to provide similar items that were:
- Clicked
- Clicked/Carted/Ordered
- Bought [3].
The RAPIDS cuDF GPU data science library was used to quickly process these matrices instead of relying on multi-core dataframe libraries.
The competition provided a large dataset for training and testing, with a time span of approximately 6 weeks, sourced from OTTO, the largest German online shop. The dataset contains the following data points (in millions):
- 12.9M sessions (user activity)
- 1.8M items (products)
- 194.7M clicks
- 16.8M carts
- 5.1M orders
The goal was to recommend the next 20 items a user would: click, cart, and order.
-
Item and User Similarities:
- Word2Vec: Used for generating item similarities.
- Implicit: Utilized for Bayesian Personalized Ranking (BPR).
- Research Paper: BPR: Bayesian Personalized Ranking from Implicit Feedback
-
RAPIDS AI: GPU acceleration for data science operations (e.g., dataframes).
-
POLARS: Fast dataframe library written in Rust. It supports lazy execution and is optimized for multi-threaded operations.
-
XGBoost: XGBoost was used to build an XGB Ranker model. The model used a pairwise objective function to refine item recommendations from the candidate generation stage. The XGB Ranker model code can be found in xgb_ranker.py. Additionally, the XGB model was trained on the GPU.
# Model Setup
xgb_params = {'objective': 'rank:pairwise',
'tree_method': 'gpu_hist',
'learning_rate': hps.learning_rate,
'max_depth': hps.max_depth,
'colsample_bytree': hps.colsample_bytree,
}
model = xgb.train(xgb_params,
dtrain=dtrain,
evals=[(dtrain, 'train'),
(dvalid, 'valid')],
num_boost_round=hps.num_boost_round,
verbose_eval=hps.verbose_eval,
early_stopping_rounds=hps.early_stopping_rounds,
)
Due to the magnitude of the dataset, Out-of-Memory errors (e.g., insufficient RAM) were a common issue. To overcome this, data was loaded from disk in chunks, and the XGB model was trained in intervals. This technique is implemented in the following code, found in xgb_ranker.py.
class IterLoadForDMatrix(xgb.core.DataIter):
def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
self.features = features
self.target = target
self.df = df
self.it = 0 # set iterator to 0
self.batch_size = batch_size
self.batches = int( np.ceil( len(df) / self.batch_size ) )
super().__init__()
def reset(self):
'''Reset the iterator'''
self.it = 0
def next(self, input_data):
'''Yield next batch of data.'''
if self.it == self.batches:
return 0 # Return 0 when there's no more batch.
a = self.it * self.batch_size
b = min( (self.it + 1) * self.batch_size, len(self.df) )
dt = cudf.DataFrame(self.df.iloc[a:b])
input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
self.it += 1
return 1
def stratify_folds(df):
# Stratify Group K-Fold
skf = GroupKFold(n_splits=5)
strat = {}
for fold,(train_idx, valid_idx) in enumerate(skf.split(df, df['target'], groups=df['session'])):
strat[fold] = {'fold': fold,
'train_idx': train_idx,
'valid_idx': valid_idx,
'num_rows': len(df)}
return strat