Closed
Description
Hi,
Thanks for your effort!
When I evalutate deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
using the provided code in this repo on math-500, I cannot reproduce the reported performance. I'm only getting a 0.756 but the reported score of open-r1 is 0.816 and deepseek reports 0.839 in their technical report. The script I'm using is provided below:
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
--save-details \
--output-dir $OUTPUT_DIR
Thanks for looking into this issue. Appreciate your work again!
Metadata
Metadata
Assignees
Labels
No labels