Skip to content

Commit 96e74e2

Browse files
Add config files for models (#131)
* added the different configs through yaml files * edited task examples path --------- Co-authored-by: Nathan Habib <[email protected]>
1 parent 0f5e257 commit 96e74e2

20 files changed

+169
-210
lines changed

.gitignore

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -164,29 +164,19 @@ tests/.data
164164
tests/data
165165

166166
# outputs folder
167-
examples/*/outputs
168-
examples/*/NeMo_experiments
169-
examples/*/nemo_experiments
170-
examples/*/.hydra
171-
examples/*/wandb
172-
examples/*/data
173167
wandb
174168
dump.py
175169

176170
docs/sources/source/test_build/
177171

178172
# Checkpoints, config files and temporary files created in tutorials.
179-
examples/neural_graphs/*.chkpt
180-
examples/neural_graphs/*.yml
181-
182173
.hydra/
183174
nemo_experiments/
184175

185176
.ruff_cache
186177

187178
tmp.py
188179

189-
examples
190180
benchmark_output
191181
prod_env
192182

README.md

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@ We're releasing it with the community in the spirit of building in the open.
1010
Note that it is still very much early so don't expect 100% stability ^^'
1111
In case of problems or question, feel free to open an issue!
1212

13-
## News
14-
- **Feb 08, 2024**: Release of `lighteval`
15-
1613
## Installation
1714

1815
Clone the repo:
@@ -98,7 +95,7 @@ Here, `--tasks` refers to either a _comma-separated_ list of supported tasks fro
9895
suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}
9996
```
10097

101-
or a file path like [`tasks_examples/recommended_set.txt`](./tasks_examples/recommended_set.txt) which specifies multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA benchmark run:
98+
or a file path like [`examples/tasks/recommended_set.txt`](./examples/tasks/recommended_set.txt) which specifies multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA benchmark run:
10299

103100
```shell
104101
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
@@ -118,7 +115,20 @@ accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
118115
--output_dir="./evals/"
119116
```
120117

121-
See the [`tasks_examples/recommended_set.txt`](./tasks_examples/recommended_set.txt) file for a list of recommended task configurations.
118+
See the [`examples/tasks/recommended_set.txt`](./examples/tasks/recommended_set.txt) file for a list of recommended task configurations.
119+
120+
### Evaluating a model with a complex configuration
121+
122+
If you want to evaluate a model by spinning up inference endpoints, or use adapter/delta weights, or more complex configuration options, you can load models using a configuration file. This is done as follows:
123+
124+
```shell
125+
accelerate launch --multi_gpu --num_processes=<num_gpus> run_evals_accelerate.py \
126+
--model_config_path="<path to your model configuration>" \
127+
--tasks <task parameters> \
128+
--output_dir output_dir
129+
```
130+
131+
Examples of possible configuration files are provided in `examples/model_configs`.
122132

123133
### Evaluating a large model with pipeline parallelism
124134

@@ -127,15 +137,13 @@ To evaluate models larger that ~40B parameters in 16-bit precision, you will nee
127137
```shell
128138
# PP=2, DP=4 - good for models < 70B params
129139
accelerate launch --multi_gpu --num_processes=4 run_evals_accelerate.py \
130-
--model_args="pretrained=<path to model on the hub>" \
131-
--model_parallel \
140+
--model_args="pretrained=<path to model on the hub>,model_parallel=True" \
132141
--tasks <task parameters> \
133142
--output_dir output_dir
134143

135144
# PP=4, DP=2 - good for huge models >= 70B params
136145
accelerate launch --multi_gpu --num_processes=2 run_evals_accelerate.py \
137-
--model_args="pretrained=<path to model on the hub>" \
138-
--model_parallel \
146+
--model_args="pretrained=<path to model on the hub>,model_parallel=True" \
139147
--tasks <task parameters> \
140148
--output_dir output_dir
141149
```
@@ -147,7 +155,7 @@ To evaluate a model on all the benchmarks of the [Open LLM Leaderboard](https://
147155
```shell
148156
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
149157
--model_args "pretrained=<model name>" \
150-
--tasks tasks_examples/open_llm_leaderboard_tasks.txt \
158+
--tasks examples/tasks/open_llm_leaderboard_tasks.txt \
151159
--override_batch_size 1 \
152160
--output_dir="./evals/"
153161
```
@@ -220,7 +228,7 @@ However, we are very grateful to the Harness and HELM teams for their continued
220228
- [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
221229
- [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
222230
- [tasks](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks): Available tasks. The complete list is in `tasks_table.jsonl`, and you'll find all the prompts in `tasks_prompt_formatting.py`. Popular tasks requiring custom logic are exceptionally added in the [extended tasks](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended).
223-
- [tasks_examples](https://github.com/huggingface/lighteval/tree/main/tasks_examples) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
231+
- [examples/tasks](https://github.com/huggingface/lighteval/tree/main/examples/tasks) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
224232
- [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.
225233

226234
## Customisation
@@ -291,7 +299,7 @@ if __name__ == "__main__":
291299

292300
You can then give your custom metric to lighteval by using `--custom-tasks path_to_your_file` when launching it.
293301

294-
To see an example of a custom metric added along with a custom task, look at `tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py`.
302+
To see an example of a custom metric added along with a custom task, look at `examples/tasks/custom_tasks_with_custom_metrics/ifeval/ifeval.py`.
295303

296304
## Available metrics
297305
### Metrics for multiple choice tasks
@@ -414,7 +422,7 @@ source <path_to_your_venv>/activate #or conda activate yourenv
414422
cd <path_to_your_lighteval>/lighteval
415423

416424
export CUDA_LAUNCH_BLOCKING=1
417-
srun accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=your model name" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
425+
srun accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=your model name" --tasks examples/tasks/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
418426
```
419427

420428
## Releases
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
model:
2+
type: "base" # can be base, tgi, or endpoint
3+
base_params:
4+
model_args: "pretrained=HuggingFaceH4/zephyr-7b-beta,revision=main" # pretrained=model_name,trust_remote_code=boolean,revision=revision_to_use,model_parallel=True ...
5+
dtype: "bfloat16"
6+
merged_weights: # Ignore this section if you are not using PEFT models
7+
delta_weights: false # set to True of your model should be merged with a base model, also need to provide the base model name
8+
adapter_weights: false # set to True of your model has been trained with peft, also need to provide the base model name
9+
base_model: null # path to the base_model
10+
generation:
11+
multichoice_continuations_start_space: false # Whether to force multiple choice continuations to start with a space
12+
no_multichoice_continuations_start_space: false # Whether to force multiple choice continuations to not start with a space
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
model:
2+
type: "endpoint" # can be base, tgi, or endpoint
3+
base_params:
4+
endpoint_name: "llama-2-7B-lighteval" # needs to be lower case without special characters
5+
model: "meta-llama/Llama-2-7b-hf"
6+
revision: "main"
7+
dtype: "float16" # can be any of "awq", "eetq", "gptq", "4bit' or "8bit" (will use bitsandbytes), "bfloat16" or "float16"
8+
reuse_existing: false # if true, ignore all params in instance
9+
instance:
10+
accelerator: "gpu"
11+
region: "eu-west-1"
12+
vendor: "aws"
13+
instance_size: "medium"
14+
instance_type: "g5.2xlarge"
15+
framework: "pytorch"
16+
endpoint_type: "protected"
17+
generation:
18+
add_special_tokens: true

examples/model_configs/tgi_model.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
model:
2+
type: "tgi" # can be base, tgi, or endpoint
3+
instance:
4+
inference_server_address: ""
5+
inference_server_auth: null
File renamed without changes.
File renamed without changes.
File renamed without changes.

run_evals_accelerate.py

Lines changed: 7 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -34,54 +34,16 @@ def get_parser():
3434
group = parser.add_mutually_exclusive_group(required=True)
3535
task_type_group = parser.add_mutually_exclusive_group(required=True)
3636

37-
# Model type 1) Base model
38-
weight_type_group = parser.add_mutually_exclusive_group()
39-
weight_type_group.add_argument(
40-
"--delta_weights",
41-
action="store_true",
42-
default=False,
43-
help="set to True of your model should be merged with a base model, also need to provide the base model name",
44-
)
45-
weight_type_group.add_argument(
46-
"--adapter_weights",
47-
action="store_true",
48-
default=False,
49-
help="set to True of your model has been trained with peft, also need to provide the base model name",
50-
)
51-
parser.add_argument(
52-
"--base_model", type=str, default=None, help="name of the base model to be used for delta or adapter weights"
53-
)
54-
37+
# Model type: either use a config file or simply the model name
38+
task_type_group.add_argument("--model_config_path")
5539
task_type_group.add_argument("--model_args")
56-
parser.add_argument("--model_dtype", type=str, default=None)
57-
parser.add_argument(
58-
"--multichoice_continuations_start_space",
59-
action="store_true",
60-
help="Whether to force multiple choice continuations to start with a space",
61-
)
62-
parser.add_argument(
63-
"--no_multichoice_continuations_start_space",
64-
action="store_true",
65-
help="Whether to force multiple choice continuations to not start with a space",
66-
)
67-
parser.add_argument("--use_chat_template", default=False, action="store_true")
68-
parser.add_argument("--system_prompt", type=str, default=None)
69-
# Model type 2) TGI
70-
task_type_group.add_argument("--inference_server_address", type=str)
71-
parser.add_argument("--inference_server_auth", type=str, default=None)
72-
# Model type 3) Inference endpoints
73-
task_type_group.add_argument("--endpoint_model_name", type=str)
74-
parser.add_argument("--revision", type=str)
75-
parser.add_argument("--accelerator", type=str, default=None)
76-
parser.add_argument("--vendor", type=str, default=None)
77-
parser.add_argument("--region", type=str, default=None)
78-
parser.add_argument("--instance_size", type=str, default=None)
79-
parser.add_argument("--instance_type", type=str, default=None)
80-
parser.add_argument("--reuse_existing", default=False, action="store_true")
40+
8141
# Debug
8242
parser.add_argument("--max_samples", type=int, default=None)
43+
parser.add_argument("--override_batch_size", type=int, default=-1)
8344
parser.add_argument("--job_id", type=str, help="Optional Job ID for future reference", default="")
8445
# Saving
46+
parser.add_argument("--output_dir", required=True)
8547
parser.add_argument("--push_results_to_hub", default=False, action="store_true")
8648
parser.add_argument("--save_details", action="store_true")
8749
parser.add_argument("--push_details_to_hub", default=False, action="store_true")
@@ -95,8 +57,8 @@ def get_parser():
9557
help="Hub organisation where you want to store the results. Your current token must have write access to it",
9658
)
9759
# Common parameters
98-
parser.add_argument("--output_dir", required=True)
99-
parser.add_argument("--override_batch_size", type=int, default=-1)
60+
parser.add_argument("--use_chat_template", default=False, action="store_true")
61+
parser.add_argument("--system_prompt", type=str, default=None)
10062
parser.add_argument("--dataset_loading_processes", type=int, default=1)
10163
parser.add_argument(
10264
"--custom_tasks",

src/lighteval/main_accelerate.py

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -131,18 +131,15 @@ def main(args):
131131
final_dict = evaluation_tracker.generate_final_dict()
132132

133133
with htrack_block("Cleaninp up"):
134-
if args.delta_weights:
135-
tmp_weights_dir = f"{evaluation_tracker.general_config_logger.model_name}-delta-applied"
136-
hlog(f"Removing {tmp_weights_dir}")
137-
shutil.rmtree(tmp_weights_dir)
138-
if args.adapter_weights:
139-
tmp_weights_dir = f"{evaluation_tracker.general_config_logger.model_name}-adapter-applied"
140-
hlog(f"Removing {tmp_weights_dir}")
141-
shutil.rmtree(tmp_weights_dir)
134+
for weights in ["delta", "adapter"]:
135+
try:
136+
tmp_weights_dir = f"{evaluation_tracker.general_config_logger.model_name}-{weights}-applied"
137+
hlog(f"Removing {tmp_weights_dir}")
138+
shutil.rmtree(tmp_weights_dir)
139+
except OSError:
140+
pass
142141

143142
print(make_results_table(final_dict))
144143

145-
if not args.reuse_existing:
146-
model.cleanup()
147-
144+
model.cleanup()
148145
return final_dict

src/lighteval/models/base_model.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@
4242
LoglikelihoodReturn,
4343
LoglikelihoodSingleTokenReturn,
4444
)
45-
from lighteval.models.utils import _get_dtype, _get_precision, _simplify_name, batched
45+
from lighteval.models.utils import _get_dtype, _simplify_name, batched
4646
from lighteval.tasks.requests import (
4747
GreedyUntilMultiTurnRequest,
4848
GreedyUntilRequest,
@@ -88,7 +88,7 @@ def __init__(
8888
self.multichoice_continuations_start_space = config.multichoice_continuations_start_space
8989

9090
# We are in DP (and launch the script with `accelerate launch`)
91-
if not config.model_parallel and not config.load_in_4bit and not config.load_in_8bit:
91+
if not config.model_parallel and config.quantization_config is None:
9292
# might need to use accelerate instead
9393
# self.model = config.accelerator.prepare(self.model)
9494
hlog(f"Using Data Parallelism, putting model on device {self._device}")
@@ -97,7 +97,7 @@ def __init__(
9797
self.model_name = _simplify_name(config.pretrained)
9898
self.model_sha = config.get_model_sha()
9999

100-
self.precision = _get_precision(config, model_auto_config=self._config)
100+
self.precision = _get_dtype(config.dtype, config=self._config)
101101

102102
@property
103103
def tokenizer(self):

src/lighteval/models/endpoint_model.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ class InferenceEndpointModel(LightevalModel):
6363
def __init__(
6464
self, config: Union[InferenceEndpointModelConfig, InferenceModelConfig], env_config: EnvConfig
6565
) -> None:
66+
self.reuse_existing = getattr(config, "should_reuse_existing", True)
6667
if isinstance(config, InferenceEndpointModelConfig):
6768
if config.should_reuse_existing:
6869
self.endpoint = get_inference_endpoint(name=config.name, token=env_config.token)
@@ -130,7 +131,7 @@ def disable_tqdm(self) -> bool:
130131
False # no accelerator = this is the main process
131132

132133
def cleanup(self):
133-
if self.endpoint is not None:
134+
if self.endpoint is not None and not self.reuse_existing:
134135
self.endpoint.delete()
135136
hlog_warn(
136137
"You deleted your endpoint after using it. You'll need to create it again if you need to reuse it."

0 commit comments

Comments
 (0)