Description
If I see the Reward Trainer example, the token length of chosen and rejected is varied and not same because the padding only effect to batch of each chosen and rejected. So, if the longest token in chosen is 256 and the longest token in rejected is 128, the token will padded accordingly. But, computing loss seems to compute the difference of logits between chosen and rejected. How is this achieved when the token size between chosen and rejected are not same?
I know the token size will not be same because in RewardTrainer
, if you did not specified data_collator
, RewardDataCollatorWithPadding
will be used with init parameter tokenizer=tokenizer
, and max_length=max_length
. The problem is padding
parameter is not set to 'max_length'
and that makes the length of chosen and rejected are not same.