Confused with Reward Training Data Collator

If I see the Reward Trainer example, the token length of chosen and rejected is varied and not same because the padding only effect to batch of each chosen and rejected. So, if the longest token in chosen is 256 and the longest token in rejected is 128, the token will padded accordingly. But, computing loss seems to compute the difference of logits between chosen and rejected. How is this achieved when the token size between chosen and rejected are not same?

I know the token size will not be same because in `RewardTrainer`, if you did not specified `data_collator`, `RewardDataCollatorWithPadding` will be used with init parameter `tokenizer=tokenizer`, and `max_length=max_length`. The problem is `padding` parameter is not set to `'max_length'` and that makes the length of chosen and rejected are not same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Confused with Reward Training Data Collator #686

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Confused with Reward Training Data Collator #686

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions