-
Notifications
You must be signed in to change notification settings - Fork 10
请问怎么自定义自己的数据集,以及更换reward 函数 #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
感谢关注,并感谢之前对其他issue的解答。 1.关于数据集构建,可以参考以下代码,会直接生成同样列数的dataset格式文件,你需要修改对应代码,以及生成bbox和point #11 。 from datasets import Dataset, DatasetDict, Image, Features, Value
from huggingface_hub import create_repo
import os
import json
from tqdm import tqdm
import glob
from tqdm import tqdm
from PIL import Image as PILImage
import cv2
def scale_box_coordinates(bbox_2d, x_factor, y_factor):
"""
对边界框坐标进行缩放
bbox_2d: [x1, y1, x2, y2]
"""
# 缩放边界框坐标
scaled_bbox = [
int(bbox_2d[0] * x_factor + 0.5), # x1
int(bbox_2d[1] * y_factor + 0.5), # y1
int(bbox_2d[2] * x_factor + 0.5), # x2
int(bbox_2d[3] * y_factor + 0.5) # y2
]
return scaled_bbox
def scale_point_coordinates(point_2d, x_factor, y_factor):
"""
对中心点坐标进行缩放
point_2d: [x, y]
"""
# 缩放中心点坐标
scaled_point = [
int(point_2d[0] * x_factor + 0.5), # x
int(point_2d[1] * y_factor + 0.5) # y
]
return scaled_point
def create_local_dataset(train_data, output_dir, image_resize):
"""
创建数据集并保存到本地
"""
# 2. 处理数据
def process_split(split_data, image_resize):
processed_data = split_data.copy()
images = []
for img_path in split_data['image']:
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (image_resize, image_resize), interpolation=cv2.INTER_AREA)
images.append(img)
processed_data['image'] = images
return processed_data
# 3. 创建数据集 (修改后只包含train集)
dataset = DatasetDict({
'train': Dataset.from_dict(
process_split(train_data, image_resize),
features=Features({
'id': Value('string'),
'problem': Value('string'),
'solution': Value('string'),
'image': Image(),
'img_height': Value('int64'),
'img_width': Value('int64')
})
)
})
# 4. 保存到本地
dataset.save_to_disk(output_dir)
print(f"数据集已保存到: {output_dir}")
return dataset
# 使用示例
if __name__ == "__main__":
data_path_list = [
"your data annotation json"
]
data = []
for data_path in data_path_list:
data.extend(json.load(open(data_path, 'r')))
image_resize = 840
id_list = []
problem_list = []
solution_list = []
image_list = []
img_height_list = []
img_width_list = []
for idx, item in tqdm(enumerate(data)):
id_list.append(item['id'])
problem_list.append(item['problem'])
image_list.append(item['image_path'])
# print(item['image_path'])
image = cv2.imread(item['image_path'])
height, width = image.shape[:2]
img_height_list.append(height)
img_width_list.append(width)
x_factor = 840 / width
y_factor = 840 / height
solution = "<bbox>.......<point>......" # change format here
solution_list.append(solution)
train_data = {
'id': id_list,
'problem': problem_list,
'solution': solution_list,
'image': image_list,
'img_height': img_height_list,
'img_width': img_width_list
}
dataset = create_local_dataset(
train_data=train_data,
output_dir=f"your save path",
image_resize=image_resize
) 2.关于reward functions,你可以全局搜索seg_strict关键词,修改对应出现的地方。更简单的方式是,直接修改seg_strict.py,把里面reward改成你想要的类型。 以上是临时解决方案,在不久的几周,我们会出一个更详细的相关说明。 |
请问这里的尺寸缩放是什么原理呢,为什么要按照840计算缩放因子? |
缩放是为了把不同尺寸的图片缩放到同样的大小,840只是default settings,你也可以设定为1024或者其他的。 |
好的,谢谢你。 |
都可以的。 |
谢谢作者的工作,很有意思。但是我有两个问题:
1. 如何自定义自己的数据集:
Seg-zero/training_scripts/seg_zero_7b.yaml 下面
huggingface上面的数据集是
refCOCOg_2k_840/train/0000.parquet
这种格式的,我们自己要改成.parquet这样的吗?因为我看了easyr1里面,自定义数据集里面还有,不知道是怎么用的,怎么换成自己本地的数据集地址呢?
2.如何更换reward函数:
Seg-zero/training_scripts/seg_zero_7b.yaml 下面
seg_strict在
Seg-zero/verl/utils/reward_score/seg_strict
那我是不是 Seg-zero/verl/utils/reward_score/下面写一个我自己的reward函数,比如说 gec.py
然后:
### 请问作者能出一个处理数据集pipeline,以及如何替换自己的reward函数的教程吗?非常感谢🙏
The text was updated successfully, but these errors were encountered: