merged RL model performed baddly #347

vpegasus · 2023-05-08T11:34:47Z

Hi @younesbelkada I download your trained model for some testing.

I want to get the final rl model for inferencing via the following steps:

Step 1: Download Meta's llama model and convert to Transformers format, name it as hf model by script;
Step 2: Download se model and rl model from:https://huggingface.co/trl-lib/llama-7b-se-peft and https://huggingface.co/trl-lib/llama-7b-se-rl-peft , respectively.
Step3: merge hf model and se model via the following script:

peft_model_id='se model pth'
config = PeftConfig.from_pretrained(peft_model_id)
model = LlamaForCausalLM.from_pretrained(hf_model_pth)
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.merge_and_unload()
model = model.base_model
model.save_pretrained('./merged_se')

Step4: merge merged_se model and 'rl model to get the final rl model via the following script:

peft_model_id='rl model pth'
config = PeftConfig.from_pretrained(peft_model_id)
model = LlamaForCausalLM.from_pretrained(merged_se_model_pth)
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.merge_and_unload()
model = model.base_model
model.save_pretrained('./merged_rl')

However, the response of merged rl model via the following script:

prompt = "Question: CUDA runtime error (59) : device-side assert triggered\n\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(input_ids = inputs.input_ids,max_new_tokens=200)
resp = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(resp)

is very bad:

Question: CUDA runtime error (59) : device-side assert triggered

Answer:ocachestra7 buttons died alcuni Wolf jeunesDefinition września Komm Komm Komm Komm Komm Komm occ 
headquarters Komm occ Komm occ Bibliografía Amb garden Außerdem Komm occ Bibliografía clean Außerdemscan Amb 
Außerdem Außerdem Außerdemscan Religion Amb equality Bibliografía Amb Bibliografía Bibliografía Bibliografía Bibliografía 
Religion Bibliografía equality Bibliografía string Amb Bibliografía Bibliografía string Amb PAcolog clean Religion Bibliografía string 
arr Amb Religion Bibliografía Bibliografía Bibliografía Religion Religion Religion Religion Religion Religion Religion 
cinqscanquartersquartersscanquarters Entrerimin Amb equality cinqной secret Z Business худоfb gauge Лі это Лі government 
Ліpaste худо pop NC Ліpastepastepaste Unless худо government NC NC NC NC NC NC NCzat худо NC NCili Amb朱 NC NC NC 
NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC 
NC NC NC NC NC NC NCzat�inctadamente NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC 
NC NC NC NC NC

But
When only merging rl model and hf model, the response seems normal:

Question: CUDA runtime error (59) : device-side assert triggered

Answer: I'm not sure if this is the same problem, but I had a similar problem.

I was using a CUDA kernel to do some matrix multiplication. The kernel was working fine, but when I tried to run it on a different GPU, I got the same error.

I found that the problem was that the kernel was using a matrix that was too large for the GPU.

I solved the problem by using a smaller matrix.

Comment: I'm not sure if this is the same problem, but I had a similar problem.

I was using a CUDA kernel to do some matrix multiplication. The kernel was working fine, but when I tried to run it on a different GPU, I got the same error.

I found that the problem was that the kernel was using a matrix that was too large for the GPU.

I solved the problem by using a smaller matrix.

Comment: I'm not

Please help me which step I made wrongly. Thanks.

The text was updated successfully, but these errors were encountered:

BIGPPWONG · 2023-05-09T01:08:19Z

I encountered the same problem. Although merging rl model and hf model the response seems normal. The response is irrelevant and there is a significant degradation compared to the web demo.

BIGPPWONG · 2023-05-09T02:56:03Z

After switching from "LlamaTokenizer" to "LlamaTokenizerFast", I think the problem is solved.

vpegasus · 2023-05-09T07:32:18Z

After switching from "LlamaTokenizer" to "LlamaTokenizerFast", I think the problem is solved.

Hi, @BIGPPWONG Thanks for your kindly help. But the way didn't work for me, I don't know if something of importance is missed by me.

Previously, I took advantage of AutoTokenizer(as with llama scripts ). Followed your advice, I changed from AutoTokenizer to LlamaTokenizerFast. The response is still bad (hf model-->se model---> rl model) :

Question: Cuda asserted error.

Answer:hos Pen turno accuracy Section mystery kitchen₇ Castro Castro accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
...

In order to dig into your advice. I do some test:

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoTokenizer, LlamaForCausalLM, LlamaTokenizerFast,LlamaTokenizer

# load tokenizers
AutoT = AutoTokenizer.from_pretrained(hf_pth)
LlamaT = LlamaTokenizer.from_pretrained(hf_pth)
LlamaFastT = LlamaTokenizerFast.from_pretrained(hf_pth)
prompt = "hello, world!"

AutoT(prompt, return_tensors="pt")

{'input_ids': tensor([[    1, 22172, 29892,  3186, 29991]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0]], 
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}

LlamaT(prompt, return_tensors="pt")

{'input_ids': tensor([[    1, 22172, 29892,  3186, 29991]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], 
device='cuda:0')}

LlamaFastT(prompt, return_tensors="pt")

{'input_ids': tensor([[    1, 22172, 29892,  3186, 29991]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0]], 
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}

From the test above, we can make a roughly conclusion: the AutoTokenizer calls the LlamaTokenizerFast, under the hood. What's more, the token ids are the same among these tokenizers.

In addition, I rechecked the difference between tokenzier and tokenzierFast in tokenizer webpage:
the main difference between them is:
a full python implementation and a “Fast” implementation based on the Rust library.

@BIGPPWONG So, Could you please tell me if I have something misunderstood? Thanks!

And
@younesbelkada also, Looking forward to your help. Thanks.

PS: I found the tokenizer updated a few days ago 22402, is the update relevant to my issue?

BIGPPWONG · 2023-05-09T08:41:36Z

I did the following to make stackllama perform normally.

uninstall transformers 4.29.0dev0
install transformers 4.28.1 stable and protubuf 3.20.1
change tokenizer from llamatokenizer to autotokenizer, and tokenizer is loaded as llamatokenizerfast
test again

FYI, I merged hf model with se model and rl model step by step.

younesbelkada · 2023-05-09T12:51:30Z

Hi @BIGPPWONG @vpegasus
Thanks very much for digging into this, can you try to follow the steps suggested by @BIGPPWONG ?
If that works, then huggingface/transformers#22402 might be the culprit but not sure.

vpegasus · 2023-05-10T07:58:03Z

Hi @younesbelkada @BIGPPWONG Many thanks for both of you! Your kindly and detailed replies encouraged me to find out where I misdone.

First of all, I copied the procedures @BIGPPWONG shared (mainly reinstall transformers from 4.29-->4.28.1). But it didn't work for me, either.

As there is no more room for me to make trouble, I finally found that the culprit was my randomly using func LoraModel.merge_and_upload().

As shown in my first description of this issue, I use script like:

config = PeftConfig.from_pretrained(peft_model_id)
model = LlamaForCausalLM.from_pretrained(hf_model_pth)
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.merge_and_unload() # this line troubled me.
model = model.base_model

this script may make trouble.

Now, I reuse merge_peft_adapter.py obediently.

merge hf model with se model to get merged_se_model, and then merge merged_se_model with rl model to get merged_rl_model.

Thus, the outputs are never error codes.

Below is a example of output

Question: PyTorch: How to get the shape of a Tensor as a list of int

Answer: You can use `torch.tensor(shape).to_numpy()` or `torch.from_numpy(np.array([1,2])).shape`

Thanks again @BIGPPWONG @younesbelkada

younesbelkada · 2023-05-10T08:10:07Z

Awesome! This is great to hear!
Feel free to close the issue!

vpegasus · 2023-05-10T08:42:51Z

Ok, thanks again!

prasad4fun · 2023-05-12T21:29:09Z

Awesome! This is great to hear! Feel free to close the issue!

Hi @younesbelkada , i used below code for inference trained model.

    model = AutoModelForCausalLMWithValueHead.from_pretrained(rl_model_pth, 
                                                          load_in_8bit=True,
                                                          peft_config=lora_config,).to(device)
    tokenizer = LlamaTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
    query_txt = "<random question here>"
    query_tensor = tokenizer.encode(query_txt, return_tensors="pt").to(device)

    response_tensor = respond_to_batch(model, query_tensor)
    response_txt = tokenizer.decode(response_tensor[0,:])

The response seem to be mostly short, whereas in stack-llama rl_training.py i could see ppo_trainer.generate in combination of length sampler getting used for text generation.
Also in gradio demo, there are additional params(temperature, max new tokens..) to control generation.

Could you please suggest how to use these additional params, respond_to_batch seem to be limited.

vpegasus closed this as completed May 10, 2023

prasad4fun mentioned this issue May 12, 2023

suboptimal text getting generated in comparison to spaces demo. #369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merged RL model performed baddly #347

merged RL model performed baddly #347

vpegasus commented May 8, 2023 •

edited

Loading

BIGPPWONG commented May 9, 2023

BIGPPWONG commented May 9, 2023

vpegasus commented May 9, 2023 •

edited

Loading

BIGPPWONG commented May 9, 2023

younesbelkada commented May 9, 2023

vpegasus commented May 10, 2023

younesbelkada commented May 10, 2023

vpegasus commented May 10, 2023

prasad4fun commented May 12, 2023

merged RL model performed baddly #347

merged RL model performed baddly #347

Comments

vpegasus commented May 8, 2023 • edited Loading

BIGPPWONG commented May 9, 2023

BIGPPWONG commented May 9, 2023

vpegasus commented May 9, 2023 • edited Loading

BIGPPWONG commented May 9, 2023

younesbelkada commented May 9, 2023

vpegasus commented May 10, 2023

younesbelkada commented May 10, 2023

vpegasus commented May 10, 2023

prasad4fun commented May 12, 2023

vpegasus commented May 8, 2023 •

edited

Loading

vpegasus commented May 9, 2023 •

edited

Loading