@@ -50,23 +50,23 @@ To install `uv`, follow the [UV Installation Guide](https://docs.astral.sh/uv/ge
50
50
51
51
52
52
``` shell
53
- uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip
53
+ uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip --link-mode=copy
54
54
```
55
55
56
56
Next, install vLLM:
57
57
58
58
``` shell
59
- uv pip install vllm> = 0.7.0
59
+ uv pip install vllm== 0.7.1
60
60
61
61
# For CUDA 12.1
62
- pip install vllm> = 0.7.0 --extra-index-url https://download.pytorch.org/whl/cu121
62
+ uv pip install vllm== 0.7.1 --extra-index-url https://download.pytorch.org/whl/cu121 --index-strategy unsafe-best-match --link-mode=copy
63
63
export LD_LIBRARY_PATH=$( python -c " import site; print(site.getsitepackages()[0] + '/nvidia/nvjitlink/lib')" ) :$LD_LIBRARY_PATH
64
64
```
65
65
66
66
This will also install PyTorch ` v2.5.1 ` and it is ** very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via ` pip install -e .[LIST OF MODES] ` . For most contributors, we recommend:
67
67
68
68
``` shell
69
- pip install -e " .[dev]"
69
+ GIT_LFS_SKIP_SMUDGE=1 uv pip install -e " .[dev]" --link-mode=copy
70
70
```
71
71
72
72
Next, log into your Hugging Face and Weights and Biases accounts as follows:
@@ -141,30 +141,46 @@ We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1
141
141
142
142
``` shell
143
143
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
144
- MODEL_ARGS=" pretrained=$MODEL ,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
145
- TASK=aime24
144
+ MODEL_ARGS=" pretrained=$MODEL ,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8"
146
145
OUTPUT_DIR=data/evals/$MODEL
147
146
147
+ # AIME 2024
148
+ TASK=aime24
149
+ lighteval vllm $MODEL_ARGS " custom|$TASK |0|0" \
150
+ --custom-tasks src/open_r1/evaluate.py \
151
+ --use-chat-template \
152
+ --output-dir $OUTPUT_DIR
153
+
154
+ # MATH-500
155
+ TASK=math_500
156
+ lighteval vllm $MODEL_ARGS " custom|$TASK |0|0" \
157
+ --custom-tasks src/open_r1/evaluate.py \
158
+ --use-chat-template \
159
+ --output-dir $OUTPUT_DIR
160
+
161
+ # GPQA Diamond
162
+ TASK=gpqa:diamond
148
163
lighteval vllm $MODEL_ARGS " custom|$TASK |0|0" \
149
164
--custom-tasks src/open_r1/evaluate.py \
150
165
--use-chat-template \
151
- --system-prompt=" Please reason step by step, and put your final answer within \boxed{}." \
152
166
--output-dir $OUTPUT_DIR
153
167
```
154
168
169
+ > [ !IMPORTANT]
170
+ > You must set ` max_model_length=32768 ` in the ` vllm ` command to align with the ` generation_size ` we define per eval. Without this, ` lighteval ` will throw an error.
171
+
155
172
To increase throughput across multiple GPUs, use _ data parallel_ as follows:
156
173
157
174
``` shell
158
175
NUM_GPUS=8
159
176
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
160
- MODEL_ARGS=" pretrained=$MODEL ,dtype=float16 ,data_parallel_size=$NUM_GPUS ,max_model_length=32768,gpu_memory_utilisation=0.8"
177
+ MODEL_ARGS=" pretrained=$MODEL ,dtype=bfloat16 ,data_parallel_size=$NUM_GPUS ,max_model_length=32768,gpu_memory_utilisation=0.8"
161
178
TASK=aime24
162
179
OUTPUT_DIR=data/evals/$MODEL
163
180
164
181
lighteval vllm $MODEL_ARGS " custom|$TASK |0|0" \
165
182
--custom-tasks src/open_r1/evaluate.py \
166
183
--use-chat-template \
167
- --system-prompt=" Please reason step by step, and put your final answer within \boxed{}." \
168
184
--output-dir $OUTPUT_DIR
169
185
```
170
186
@@ -173,58 +189,105 @@ For large models which require sharding across GPUs, use _tensor parallel_ and r
173
189
``` shell
174
190
NUM_GPUS=8
175
191
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
176
- MODEL_ARGS=" pretrained=$MODEL ,dtype=float16 ,tensor_parallel_size=$NUM_GPUS ,max_model_length=32768,gpu_memory_utilisation=0.8"
192
+ MODEL_ARGS=" pretrained=$MODEL ,dtype=bfloat16 ,tensor_parallel_size=$NUM_GPUS ,max_model_length=32768,gpu_memory_utilisation=0.8"
177
193
TASK=aime24
178
194
OUTPUT_DIR=data/evals/$MODEL
179
195
180
196
export VLLM_WORKER_MULTIPROC_METHOD=spawn
181
197
lighteval vllm $MODEL_ARGS " custom|$TASK |0|0" \
182
198
--custom-tasks src/open_r1/evaluate.py \
183
199
--use-chat-template \
184
- --system-prompt=" Please reason step by step, and put your final answer within \boxed{}." \
185
200
--output-dir $OUTPUT_DIR
186
201
```
187
202
188
203
You can also launch an evaluation with ` make evaluate ` , specifying the model, task, and optionally the parallelism technique and number of GPUs.
189
204
190
205
To evaluate on a single GPU:
206
+
191
207
``` shell
192
208
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24
193
209
```
194
210
195
211
To use Data Parallelism:
212
+
196
213
``` shell
197
214
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8
198
215
```
199
216
200
217
To use Tensor Parallelism:
218
+
201
219
``` shell
202
220
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8
203
221
```
204
- ## Reproducing Deepseek's evaluation results on MATH-500
205
- We are able to reproduce Deepseek's reported results on the MATH-500 Benchmark:
206
- | Model | MATH-500 (HF lighteval) | MATH-500 (DeepSeek Reported) |
207
- | :-------------------------- | :-------: | :----------------------------: |
208
- | DeepSeek-R1-Distill-Qwen-1.5B | 81.6 | 83.9 |
209
- | DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 |
210
- | DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 |
211
- | DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 |
212
- | DeepSeek-R1-Distill-Llama-8B | 85.8 | 89.1 |
213
- | DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 |
214
222
223
+ ## Reproducing Deepseek's evaluation results
224
+
225
+ > [ !NOTE]
226
+ > The DeepSeek-R1 paper uses sampling with a temperature of 0.6, a top-p value of 0.95, and 64 responses per query to estimate ` pass@1 ` . Below, we report the results from greedy decoding, which likely explains the small 1-3σ discrepancies between our results and theirs.
227
+
228
+ ### MATH-500
229
+
230
+ We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~ 1-3 standard deviations:
215
231
232
+ | Model | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |
233
+ | :------------------------------| :-----------------------:| :----------------------------:|
234
+ | DeepSeek-R1-Distill-Qwen-1.5B | 81.2 | 83.9 |
235
+ | DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 |
236
+ | DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 |
237
+ | DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 |
238
+ | DeepSeek-R1-Distill-Llama-8B | 85.4 | 89.1 |
239
+ | DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 |
216
240
217
241
To reproduce these results use the following command:
242
+
243
+ ``` shell
244
+ NUM_GPUS=1 # Set to 8 for 32B and 70B models
245
+ MODEL=deepseek-ai/{model_name}
246
+ MODEL_ARGS=" pretrained=$MODEL ,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS "
247
+ OUTPUT_DIR=data/evals/$MODEL
248
+
249
+ lighteval vllm $MODEL_ARGS " custom|math_500|0|0" \
250
+ --custom-tasks src/open_r1/evaluate.py \
251
+ --use-chat-template \
252
+ --output-dir $OUTPUT_DIR
253
+ ```
254
+
255
+ Alternatively, you can launch Slurm jobs as follows:
256
+
218
257
``` shell
219
- sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B math_500
220
- sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-7B math_500
221
- sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-14B math_500
222
- sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-32B math_500 tp
223
- sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-8B math_500
224
- sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-70B math_500 tp
258
+ python scripts/run_benchmarks.py --model-id={model_id} --benchmarks math_500
225
259
```
226
260
261
+ ### GPQA Diamond
262
+
263
+ We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~ 1-3 standard deviations:
264
+
265
+ | Model | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
266
+ | :------------------------------| :---------------------------:| :--------------------------------:|
267
+ | DeepSeek-R1-Distill-Qwen-1.5B | 33.3 | 33.8 |
268
+ | DeepSeek-R1-Distill-Qwen-7B | 48.4 | 49.1 |
269
+ | DeepSeek-R1-Distill-Qwen-14B | 55.6 | 59.1 |
270
+ | DeepSeek-R1-Distill-Qwen-32B | 58.6 | 62.1 |
271
+ | DeepSeek-R1-Distill-Llama-8B | 51.0 | 49.0 |
272
+ | DeepSeek-R1-Distill-Llama-70B | 65.2 | 65.2 |
273
+
274
+ To reproduce these results use the following command:
275
+
276
+ ``` shell
277
+ NUM_GPUS=1 # Set to 8 for 32B and 70B models
278
+ MODEL=deepseek-ai/{model_name}
279
+ MODEL_ARGS=" pretrained=$MODEL ,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS "
280
+ OUTPUT_DIR=data/evals/$MODEL
281
+
282
+ lighteval vllm $MODEL_ARGS " custom|gpqa:diamond|0|0" \
283
+ --custom-tasks src/open_r1/evaluate.py \
284
+ --use-chat-template \
285
+ --output-dir $OUTPUT_DIR
286
+ ```
227
287
288
+ ``` shell
289
+ python scripts/run_benchmarks.py --model-id={model_id} --benchmarks gpqa
290
+ ```
228
291
229
292
## Data generation
230
293
0 commit comments