Skip to content

Commit cec57f3

Browse files
authored
Add GPQA Diamond and fix evaluation deps (#196)
* Add GPQA Diamond * Add table * Fix README * Up * Fixes * Ignore logs * Fix * Pin deps * Fix GRPO * Add Llama 70B tabels * Restore dp * Pin lighteval * Use bfloat16 * Tune table * Add note
1 parent f8cbb98 commit cec57f3

File tree

9 files changed

+202
-148
lines changed

9 files changed

+202
-148
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ evaluate:
2626
fi \
2727
),))
2828
$(if $(filter tensor,$(PARALLEL)),export VLLM_WORKER_MULTIPROC_METHOD=spawn &&,) \
29-
MODEL_ARGS="pretrained=$(MODEL),dtype=float16,$(PARALLEL_ARGS),max_model_length=32768,gpu_memory_utilisation=0.8" && \
29+
MODEL_ARGS="pretrained=$(MODEL),dtype=bfloat16,$(PARALLEL_ARGS),max_model_length=32768,gpu_memory_utilisation=0.8" && \
3030
lighteval vllm $$MODEL_ARGS "custom|$(TASK)|0|0" \
3131
--custom-tasks src/open_r1/evaluate.py \
3232
--use-chat-template \

README.md

Lines changed: 90 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -50,23 +50,23 @@ To install `uv`, follow the [UV Installation Guide](https://docs.astral.sh/uv/ge
5050

5151

5252
```shell
53-
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip
53+
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip --link-mode=copy
5454
```
5555

5656
Next, install vLLM:
5757

5858
```shell
59-
uv pip install vllm>=0.7.0
59+
uv pip install vllm==0.7.1
6060

6161
# For CUDA 12.1
62-
pip install vllm>=0.7.0 --extra-index-url https://download.pytorch.org/whl/cu121
62+
uv pip install vllm==0.7.1 --extra-index-url https://download.pytorch.org/whl/cu121 --index-strategy unsafe-best-match --link-mode=copy
6363
export LD_LIBRARY_PATH=$(python -c "import site; print(site.getsitepackages()[0] + '/nvidia/nvjitlink/lib')"):$LD_LIBRARY_PATH
6464
```
6565

6666
This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:
6767

6868
```shell
69-
pip install -e ".[dev]"
69+
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]" --link-mode=copy
7070
```
7171

7272
Next, log into your Hugging Face and Weights and Biases accounts as follows:
@@ -141,30 +141,46 @@ We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1
141141

142142
```shell
143143
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
144-
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
145-
TASK=aime24
144+
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8"
146145
OUTPUT_DIR=data/evals/$MODEL
147146

147+
# AIME 2024
148+
TASK=aime24
149+
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
150+
--custom-tasks src/open_r1/evaluate.py \
151+
--use-chat-template \
152+
--output-dir $OUTPUT_DIR
153+
154+
# MATH-500
155+
TASK=math_500
156+
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
157+
--custom-tasks src/open_r1/evaluate.py \
158+
--use-chat-template \
159+
--output-dir $OUTPUT_DIR
160+
161+
# GPQA Diamond
162+
TASK=gpqa:diamond
148163
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
149164
--custom-tasks src/open_r1/evaluate.py \
150165
--use-chat-template \
151-
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
152166
--output-dir $OUTPUT_DIR
153167
```
154168

169+
> [!IMPORTANT]
170+
> You must set `max_model_length=32768` in the `vllm` command to align with the `generation_size` we define per eval. Without this, `lighteval` will throw an error.
171+
155172
To increase throughput across multiple GPUs, use _data parallel_ as follows:
156173

157174
```shell
158175
NUM_GPUS=8
159176
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
160-
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
177+
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
161178
TASK=aime24
162179
OUTPUT_DIR=data/evals/$MODEL
163180

164181
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
165182
--custom-tasks src/open_r1/evaluate.py \
166183
--use-chat-template \
167-
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
168184
--output-dir $OUTPUT_DIR
169185
```
170186

@@ -173,58 +189,105 @@ For large models which require sharding across GPUs, use _tensor parallel_ and r
173189
```shell
174190
NUM_GPUS=8
175191
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
176-
MODEL_ARGS="pretrained=$MODEL,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
192+
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
177193
TASK=aime24
178194
OUTPUT_DIR=data/evals/$MODEL
179195

180196
export VLLM_WORKER_MULTIPROC_METHOD=spawn
181197
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
182198
--custom-tasks src/open_r1/evaluate.py \
183199
--use-chat-template \
184-
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
185200
--output-dir $OUTPUT_DIR
186201
```
187202

188203
You can also launch an evaluation with `make evaluate`, specifying the model, task, and optionally the parallelism technique and number of GPUs.
189204

190205
To evaluate on a single GPU:
206+
191207
```shell
192208
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24
193209
```
194210

195211
To use Data Parallelism:
212+
196213
```shell
197214
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8
198215
```
199216

200217
To use Tensor Parallelism:
218+
201219
```shell
202220
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8
203221
```
204-
## Reproducing Deepseek's evaluation results on MATH-500
205-
We are able to reproduce Deepseek's reported results on the MATH-500 Benchmark:
206-
| Model | MATH-500 (HF lighteval) | MATH-500 (DeepSeek Reported) |
207-
| :-------------------------- | :-------: | :----------------------------: |
208-
| DeepSeek-R1-Distill-Qwen-1.5B | 81.6 | 83.9 |
209-
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 |
210-
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 |
211-
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 |
212-
| DeepSeek-R1-Distill-Llama-8B | 85.8 | 89.1 |
213-
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 |
214222

223+
## Reproducing Deepseek's evaluation results
224+
225+
> [!NOTE]
226+
> The DeepSeek-R1 paper uses sampling with a temperature of 0.6, a top-p value of 0.95, and 64 responses per query to estimate `pass@1`. Below, we report the results from greedy decoding, which likely explains the small 1-3σ discrepancies between our results and theirs.
227+
228+
### MATH-500
229+
230+
We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations:
215231

232+
| Model | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |
233+
|:------------------------------|:-----------------------:|:----------------------------:|
234+
| DeepSeek-R1-Distill-Qwen-1.5B | 81.2 | 83.9 |
235+
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 |
236+
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 |
237+
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 |
238+
| DeepSeek-R1-Distill-Llama-8B | 85.4 | 89.1 |
239+
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 |
216240

217241
To reproduce these results use the following command:
242+
243+
```shell
244+
NUM_GPUS=1 # Set to 8 for 32B and 70B models
245+
MODEL=deepseek-ai/{model_name}
246+
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS"
247+
OUTPUT_DIR=data/evals/$MODEL
248+
249+
lighteval vllm $MODEL_ARGS "custom|math_500|0|0" \
250+
--custom-tasks src/open_r1/evaluate.py \
251+
--use-chat-template \
252+
--output-dir $OUTPUT_DIR
253+
```
254+
255+
Alternatively, you can launch Slurm jobs as follows:
256+
218257
```shell
219-
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B math_500
220-
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-7B math_500
221-
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-14B math_500
222-
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-32B math_500 tp
223-
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-8B math_500
224-
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-70B math_500 tp
258+
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks math_500
225259
```
226260

261+
### GPQA Diamond
262+
263+
We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations:
264+
265+
| Model | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
266+
|:------------------------------|:---------------------------:|:--------------------------------:|
267+
| DeepSeek-R1-Distill-Qwen-1.5B | 33.3 | 33.8 |
268+
| DeepSeek-R1-Distill-Qwen-7B | 48.4 | 49.1 |
269+
| DeepSeek-R1-Distill-Qwen-14B | 55.6 | 59.1 |
270+
| DeepSeek-R1-Distill-Qwen-32B | 58.6 | 62.1 |
271+
| DeepSeek-R1-Distill-Llama-8B | 51.0 | 49.0 |
272+
| DeepSeek-R1-Distill-Llama-70B | 65.2 | 65.2 |
273+
274+
To reproduce these results use the following command:
275+
276+
```shell
277+
NUM_GPUS=1 # Set to 8 for 32B and 70B models
278+
MODEL=deepseek-ai/{model_name}
279+
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS"
280+
OUTPUT_DIR=data/evals/$MODEL
281+
282+
lighteval vllm $MODEL_ARGS "custom|gpqa:diamond|0|0" \
283+
--custom-tasks src/open_r1/evaluate.py \
284+
--use-chat-template \
285+
--output-dir $OUTPUT_DIR
286+
```
227287

288+
```shell
289+
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks gpqa
290+
```
228291

229292
## Data generation
230293

logs/.gitkeep

Whitespace-only changes.

setup.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -53,17 +53,17 @@
5353
"huggingface-hub[cli]>=0.19.2,<1.0",
5454
"isort>=5.12.0",
5555
"liger_kernel==0.5.2",
56-
"lighteval @ git+https://github.com/huggingface/lighteval.git@0e462692436e1f0575bdb4c6ef63453ad9bde7d4#egg=lighteval[math]",
57-
"math-verify>=0.3.3", # Used for math verification in grpo
56+
"lighteval @ git+https://github.com/huggingface/lighteval.git@86f62259f105ae164f655e0b91c92a823a742724#egg=lighteval[math]",
57+
"math-verify==0.5.2", # Used for math verification in grpo
5858
"packaging>=23.0",
5959
"parameterized>=0.9.0",
6060
"pytest",
6161
"safetensors>=0.3.3",
6262
"sentencepiece>=0.1.99",
63-
"torch>=2.5.1",
63+
"torch==2.5.1",
6464
"transformers @ git+https://github.com/huggingface/transformers.git@main",
6565
"trl @ git+https://github.com/huggingface/trl.git@main",
66-
"vllm>=0.7.1",
66+
"vllm==0.7.1",
6767
"wandb>=0.19.1",
6868
]
6969

slurm/eval_callback.slurm

Lines changed: 0 additions & 75 deletions
This file was deleted.

0 commit comments

Comments
 (0)