Skip to content

Confused with Reward Training Data Collator #686

Closed
@fahadh4ilyas

Description

@fahadh4ilyas

If I see the Reward Trainer example, the token length of chosen and rejected is varied and not same because the padding only effect to batch of each chosen and rejected. So, if the longest token in chosen is 256 and the longest token in rejected is 128, the token will padded accordingly. But, computing loss seems to compute the difference of logits between chosen and rejected. How is this achieved when the token size between chosen and rejected are not same?

I know the token size will not be same because in RewardTrainer, if you did not specified data_collator, RewardDataCollatorWithPadding will be used with init parameter tokenizer=tokenizer, and max_length=max_length. The problem is padding parameter is not set to 'max_length' and that makes the length of chosen and rejected are not same.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions