Cannot replicate the performance of distilled 1.5B model

Hi,
    Thanks for your effort! 
    When I evalutate `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` using the provided code in this repo on math-500, I cannot reproduce the reported performance. I'm only getting a 0.756 but the reported score of open-r1 is 0.816 and deepseek reports 0.839 in their technical report. The script I'm using is provided below:
```
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --save-details \
    --output-dir $OUTPUT_DIR 
```

Thanks for looking into this issue. Appreciate your work again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot replicate the performance of distilled 1.5B model #194

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot replicate the performance of distilled 1.5B model #194

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions