请问怎么自定义自己的数据集，以及更换reward 函数 #23

xiaoAugenstern · 2025-04-23T08:54:12Z

谢谢作者的工作，很有意思。但是我有两个问题：

1. 如何自定义自己的数据集：

Seg-zero/training_scripts/seg_zero_7b.yaml 下面

data:
  train_files: Ricky06662/refCOCOg_2k_840
  val_files: Non
  prompt_key: problem   # 这个对应就是 prompt

huggingface上面的数据集是refCOCOg_2k_840/train/0000.parquet这种格式的，我们自己要改成.parquet这样的吗？

因为我看了easyr1里面，自定义数据集里面还有，不知道是怎么用的，怎么换成自己本地的数据集地址呢？

  prompt_key: problem
  answer_key: answer
  image_key: images

2.如何更换reward函数：

Seg-zero/training_scripts/seg_zero_7b.yaml 下面

reward:
    reward_type: function
    compute_score: seg_strict # 这里改成自己的吗？

seg_strict在 Seg-zero/verl/utils/reward_score/seg_strict

def seg_compute_score(predict_str: str, ground_truth: str) -> float:
    thinking_format_reward = seg_thinking_format_reward(predict_str)
    segmentation_format_reward = seg_segmentation_format_reward(predict_str)
    iou_reward = seg_iou_reward(predict_str, ground_truth)
    point_l1_reward = seg_point_l1_reward(predict_str, ground_truth)
    box_l1_reward = seg_box_l1_reward(predict_str, ground_truth)
    
    reward = iou_reward + thinking_format_reward + segmentation_format_reward + point_l1_reward + box_l1_reward
    return reward

那我是不是 Seg-zero/verl/utils/reward_score/下面写一个我自己的reward函数，比如说 gec.py

然后：

reward:
    reward_type: function
    compute_score: gec

### 请问作者能出一个处理数据集pipeline，以及如何替换自己的reward函数的教程吗？非常感谢🙏

The text was updated successfully, but these errors were encountered:

LiuRicky · 2025-04-24T02:06:16Z

感谢关注，并感谢之前对其他issue的解答。

1.关于数据集构建，可以参考以下代码，会直接生成同样列数的dataset格式文件，你需要修改对应代码，以及生成bbox和point #11 。

from datasets import Dataset, DatasetDict, Image, Features, Value
from huggingface_hub import create_repo
import os
import json
from tqdm import tqdm
import glob
from tqdm import tqdm
from PIL import Image as PILImage
import cv2


def scale_box_coordinates(bbox_2d, x_factor, y_factor):
    """
    对边界框坐标进行缩放
    
    bbox_2d: [x1, y1, x2, y2]
    """
    # 缩放边界框坐标
    scaled_bbox = [
        int(bbox_2d[0] * x_factor + 0.5),  # x1
        int(bbox_2d[1] * y_factor + 0.5),  # y1
        int(bbox_2d[2] * x_factor + 0.5),  # x2
        int(bbox_2d[3] * y_factor + 0.5)   # y2
    ]

    
    return scaled_bbox

def scale_point_coordinates(point_2d, x_factor, y_factor):
    """
    对中心点坐标进行缩放
    point_2d: [x, y]
    """

    # 缩放中心点坐标
    scaled_point = [
        int(point_2d[0] * x_factor + 0.5),  # x
        int(point_2d[1] * y_factor + 0.5)   # y
    ]
    
    return scaled_point

def create_local_dataset(train_data, output_dir, image_resize):
    """
    创建数据集并保存到本地
    """
    # 2. 处理数据
    def process_split(split_data, image_resize):
        processed_data = split_data.copy()
        images = []
        for img_path in split_data['image']:
            img = cv2.imread(img_path)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = cv2.resize(img, (image_resize, image_resize), interpolation=cv2.INTER_AREA)
            images.append(img)
        
        processed_data['image'] = images
        return processed_data
    
    # 3. 创建数据集 (修改后只包含train集)
    dataset = DatasetDict({
        'train': Dataset.from_dict(
            process_split(train_data, image_resize),
            features=Features({
                'id': Value('string'),
                'problem': Value('string'),
                'solution': Value('string'),
                'image': Image(),
                'img_height': Value('int64'),
                'img_width': Value('int64')
            })
        )
    })
    
    # 4. 保存到本地
    dataset.save_to_disk(output_dir)
    print(f"数据集已保存到: {output_dir}")
    
    return dataset

# 使用示例
if __name__ == "__main__":
    data_path_list = [
           "your data annotation json"
    ]

    data = []
    for data_path in data_path_list:
        data.extend(json.load(open(data_path, 'r')))
    
    image_resize = 840
        
    id_list = []
    problem_list = []
    solution_list = []
    image_list = []
    img_height_list = []
    img_width_list = []

    for idx, item in tqdm(enumerate(data)):
        id_list.append(item['id'])
        problem_list.append(item['problem'])
        
        image_list.append(item['image_path'])
        
        # print(item['image_path'])
        image = cv2.imread(item['image_path'])
        height, width = image.shape[:2]
        
        img_height_list.append(height)
        img_width_list.append(width)
        
        x_factor = 840 / width
        y_factor = 840 / height
        solution = "<bbox>.......<point>......" # change format here
        solution_list.append(solution)


    train_data = {
        'id': id_list,
        'problem': problem_list,
        'solution': solution_list,
        'image': image_list,
        'img_height': img_height_list,
        'img_width': img_width_list
    }
    
    dataset = create_local_dataset(
        train_data=train_data,
        output_dir=f"your save path",
        image_resize=image_resize
    )

2.关于reward functions，你可以全局搜索seg_strict关键词，修改对应出现的地方。更简单的方式是，直接修改seg_strict.py，把里面reward改成你想要的类型。

以上是临时解决方案，在不久的几周，我们会出一个更详细的相关说明。

unira-zwj · 2025-04-24T07:55:47Z

感谢关注，并感谢之前对其他issue的解答。

1.关于数据集构建，可以参考以下代码，会直接生成同样列数的dataset格式文件，你需要修改对应代码，以及生成bbox和point #11 。

from datasets import Dataset, DatasetDict, Image, Features, Value
from huggingface_hub import create_repo
import os
import json
from tqdm import tqdm
import glob
from tqdm import tqdm
from PIL import Image as PILImage
import cv2

def scale_box_coordinates(bbox_2d, x_factor, y_factor):
"""
对边界框坐标进行缩放
bbox_2d: [x1, y1, x2, y2]
"""
# 缩放边界框坐标
scaled_bbox = [
    int(bbox_2d[0] * x_factor + 0.5),  # x1
    int(bbox_2d[1] * y_factor + 0.5),  # y1
    int(bbox_2d[2] * x_factor + 0.5),  # x2
    int(bbox_2d[3] * y_factor + 0.5)   # y2
]


return scaled_bbox
def scale_point_coordinates(point_2d, x_factor, y_factor):
"""
对中心点坐标进行缩放
point_2d: [x, y]
"""
# 缩放中心点坐标
scaled_point = [
    int(point_2d[0] * x_factor + 0.5),  # x
    int(point_2d[1] * y_factor + 0.5)   # y
]

return scaled_point
def create_local_dataset(train_data, output_dir, image_resize):
"""
创建数据集并保存到本地
"""
# 2. 处理数据
def process_split(split_data, image_resize):
processed_data = split_data.copy()
images = []
for img_path in split_data['image']:
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (image_resize, image_resize), interpolation=cv2.INTER_AREA)
images.append(img)
    processed_data['image'] = images
    return processed_data

# 3. 创建数据集 (修改后只包含train集)
dataset = DatasetDict({
    'train': Dataset.from_dict(
        process_split(train_data, image_resize),
        features=Features({
            'id': Value('string'),
            'problem': Value('string'),
            'solution': Value('string'),
            'image': Image(),
            'img_height': Value('int64'),
            'img_width': Value('int64')
        })
    )
})

# 4. 保存到本地
dataset.save_to_disk(output_dir)
print(f"数据集已保存到: {output_dir}")

return dataset
使用示例

if name == "main":
data_path_list = [
"your data annotation json"
]
data = []
for data_path in data_path_list:
    data.extend(json.load(open(data_path, 'r')))

image_resize = 840
    
id_list = []
problem_list = []
solution_list = []
image_list = []
img_height_list = []
img_width_list = []

for idx, item in tqdm(enumerate(data)):
    id_list.append(item['id'])
    problem_list.append(item['problem'])
    
    image_list.append(item['image_path'])
    
    # print(item['image_path'])
    image = cv2.imread(item['image_path'])
    height, width = image.shape[:2]
    
    img_height_list.append(height)
    img_width_list.append(width)
    
    x_factor = 840 / width
    y_factor = 840 / height
    solution = "<bbox>.......<point>......" # change format here
    solution_list.append(solution)


train_data = {
    'id': id_list,
    'problem': problem_list,
    'solution': solution_list,
    'image': image_list,
    'img_height': img_height_list,
    'img_width': img_width_list
}

dataset = create_local_dataset(
    train_data=train_data,
    output_dir=f"your save path",
    image_resize=image_resize
)
2.关于reward functions，你可以全局搜索seg_strict关键词，修改对应出现的地方。更简单的方式是，直接修改seg_strict.py，把里面reward改成你想要的类型。

以上是临时解决方案，在不久的几周，我们会出一个更详细的相关说明。

请问这里的尺寸缩放是什么原理呢，为什么要按照840计算缩放因子？
非常感谢

LiuRicky · 2025-04-25T02:17:13Z

请问这里的尺寸缩放是什么原理呢，为什么要按照840计算缩放因子？非常感谢

缩放是为了把不同尺寸的图片缩放到同样的大小，840只是default settings，你也可以设定为1024或者其他的。

unira-zwj · 2025-04-25T06:30:37Z

请问这里的缩放比例是什么原理呢，为什么要按照840计算缩放比例？非常感谢

缩放是为了把不同尺寸的图片缩放到相同的大小，840只是默认设置，你也可以设置为1024或者其他的。

好的，谢谢你。
请问必须是缩放到宽高相等吗，qwen2.5-vl应该是支持动态分辨率的吧，这里可以resize到非正方形吗。
谢谢

LiuRicky · 2025-04-26T03:31:42Z

请问这里的缩放比例是什么原理呢，为什么要按照840计算缩放比例？非常感谢

缩放是为了把不同尺寸的图片缩放到相同的大小，840只是默认设置，你也可以设置为1024或者其他的。

好的，谢谢你。请问必须是缩放到宽高相等吗，qwen2.5-vl应该是支持动态分辨率的吧，这里可以resize到非正方形吗。谢谢

都可以的。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请问怎么自定义自己的数据集，以及更换reward 函数 #23

请问怎么自定义自己的数据集，以及更换reward 函数 #23

xiaoAugenstern commented Apr 23, 2025 •

edited

Loading

LiuRicky commented Apr 24, 2025

unira-zwj commented Apr 24, 2025

使用示例

LiuRicky commented Apr 25, 2025

unira-zwj commented Apr 25, 2025

LiuRicky commented Apr 26, 2025

请问怎么自定义自己的数据集，以及更换reward 函数 #23

请问怎么自定义自己的数据集，以及更换reward 函数 #23

Comments

xiaoAugenstern commented Apr 23, 2025 • edited Loading

LiuRicky commented Apr 24, 2025

unira-zwj commented Apr 24, 2025

使用示例

LiuRicky commented Apr 25, 2025

unira-zwj commented Apr 25, 2025

LiuRicky commented Apr 26, 2025

xiaoAugenstern commented Apr 23, 2025 •

edited

Loading