Skip to content

merged RL model performed baddly #347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vpegasus opened this issue May 8, 2023 · 9 comments
Closed

merged RL model performed baddly #347

vpegasus opened this issue May 8, 2023 · 9 comments

Comments

@vpegasus
Copy link

vpegasus commented May 8, 2023

Hi @younesbelkada I download your trained model for some testing.

I want to get the final rl model for inferencing via the following steps:

Step 1: Download Meta's llama model and convert to Transformers format, name it as hf model by script;
Step 2: Download se model and rl model from:https://huggingface.co/trl-lib/llama-7b-se-peft and https://huggingface.co/trl-lib/llama-7b-se-rl-peft , respectively.
Step3: merge hf model and se model via the following script:

peft_model_id='se model pth'
config = PeftConfig.from_pretrained(peft_model_id)
model = LlamaForCausalLM.from_pretrained(hf_model_pth)
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.merge_and_unload()
model = model.base_model
model.save_pretrained('./merged_se')

Step4: merge merged_se model and 'rl model to get the final rl model via the following script:

peft_model_id='rl model pth'
config = PeftConfig.from_pretrained(peft_model_id)
model = LlamaForCausalLM.from_pretrained(merged_se_model_pth)
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.merge_and_unload()
model = model.base_model
model.save_pretrained('./merged_rl')

However, the response of merged rl model via the following script:

prompt = "Question: CUDA runtime error (59) : device-side assert triggered\n\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(input_ids = inputs.input_ids,max_new_tokens=200)
resp = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(resp)

is very bad:

Question: CUDA runtime error (59) : device-side assert triggered

Answer:ocachestra7 buttons died alcuni Wolf jeunesDefinition września Komm Komm Komm Komm Komm Komm occ 
headquarters Komm occ Komm occ Bibliografía Amb garden Außerdem Komm occ Bibliografía clean Außerdemscan Amb 
Außerdem Außerdem Außerdemscan Religion Amb equality Bibliografía Amb Bibliografía Bibliografía Bibliografía Bibliografía 
Religion Bibliografía equality Bibliografía string Amb Bibliografía Bibliografía string Amb PAcolog clean Religion Bibliografía string 
arr Amb Religion Bibliografía Bibliografía Bibliografía Religion Religion Religion Religion Religion Religion Religion 
cinqscanquartersquartersscanquarters Entrerimin Amb equality cinqной secret Z Business худоfb gauge Лі это Лі government 
Ліpaste худо pop NC Ліpastepastepaste Unless худо government NC NC NC NC NC NC NCzat худо NC NCili Amb朱 NC NC NC 
NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC 
NC NC NC NC NC NC NCzatinctadamente NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC 
NC NC NC NC NC

But
When only merging rl model and hf model, the response seems normal:

Question: CUDA runtime error (59) : device-side assert triggered

Answer: I'm not sure if this is the same problem, but I had a similar problem.

I was using a CUDA kernel to do some matrix multiplication. The kernel was working fine, but when I tried to run it on a different GPU, I got the same error.

I found that the problem was that the kernel was using a matrix that was too large for the GPU.

I solved the problem by using a smaller matrix.

Comment: I'm not sure if this is the same problem, but I had a similar problem.

I was using a CUDA kernel to do some matrix multiplication. The kernel was working fine, but when I tried to run it on a different GPU, I got the same error.

I found that the problem was that the kernel was using a matrix that was too large for the GPU.

I solved the problem by using a smaller matrix.

Comment: I'm not

Please help me which step I made wrongly. Thanks.

@BIGPPWONG
Copy link

I encountered the same problem. Although merging rl model and hf model the response seems normal. The response is irrelevant and there is a significant degradation compared to the web demo.

@BIGPPWONG
Copy link

After switching from "LlamaTokenizer" to "LlamaTokenizerFast", I think the problem is solved.

@vpegasus
Copy link
Author

vpegasus commented May 9, 2023

After switching from "LlamaTokenizer" to "LlamaTokenizerFast", I think the problem is solved.

Hi, @BIGPPWONG Thanks for your kindly help. But the way didn't work for me, I don't know if something of importance is missed by me.

Previously, I took advantage of AutoTokenizer(as with llama scripts ). Followed your advice, I changed from AutoTokenizer to LlamaTokenizerFast. The response is still bad (hf model-->se model---> rl model) :

Question: Cuda asserted error.

Answer:hos Pen turno accuracy Section mystery kitchenCastro Castro accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy 
...

In order to dig into your advice. I do some test:

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoTokenizer, LlamaForCausalLM, LlamaTokenizerFast,LlamaTokenizer

# load tokenizers
AutoT = AutoTokenizer.from_pretrained(hf_pth)
LlamaT = LlamaTokenizer.from_pretrained(hf_pth)
LlamaFastT = LlamaTokenizerFast.from_pretrained(hf_pth)
prompt = "hello, world!"
AutoT(prompt, return_tensors="pt")

{'input_ids': tensor([[    1, 22172, 29892,  3186, 29991]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0]], 
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}
LlamaT(prompt, return_tensors="pt")

{'input_ids': tensor([[    1, 22172, 29892,  3186, 29991]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], 
device='cuda:0')}
LlamaFastT(prompt, return_tensors="pt")

{'input_ids': tensor([[    1, 22172, 29892,  3186, 29991]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0]], 
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1]], device='cuda:0')}

From the test above, we can make a roughly conclusion: the AutoTokenizer calls the LlamaTokenizerFast, under the hood. What's more, the token ids are the same among these tokenizers.

In addition, I rechecked the difference between tokenzier and tokenzierFast in tokenizer webpage:
the main difference between them is:
a full python implementation and a “Fast” implementation based on the Rust library.

@BIGPPWONG So, Could you please tell me if I have something misunderstood? Thanks!

And
@younesbelkada also, Looking forward to your help. Thanks.

PS: I found the tokenizer updated a few days ago 22402, is the update relevant to my issue?

@BIGPPWONG
Copy link

I did the following to make stackllama perform normally.

  1. uninstall transformers 4.29.0dev0
  2. install transformers 4.28.1 stable and protubuf 3.20.1
  3. change tokenizer from llamatokenizer to autotokenizer, and tokenizer is loaded as llamatokenizerfast
  4. test again

FYI, I merged hf model with se model and rl model step by step.

@younesbelkada
Copy link
Contributor

Hi @BIGPPWONG @vpegasus
Thanks very much for digging into this, can you try to follow the steps suggested by @BIGPPWONG ?
If that works, then huggingface/transformers#22402 might be the culprit but not sure.

@vpegasus
Copy link
Author

Hi @younesbelkada @BIGPPWONG Many thanks for both of you! Your kindly and detailed replies encouraged me to find out where I misdone.

First of all, I copied the procedures @BIGPPWONG shared (mainly reinstall transformers from 4.29-->4.28.1). But it didn't work for me, either.

As there is no more room for me to make trouble, I finally found that the culprit was my randomly using func LoraModel.merge_and_upload().

As shown in my first description of this issue, I use script like:

config = PeftConfig.from_pretrained(peft_model_id)
model = LlamaForCausalLM.from_pretrained(hf_model_pth)
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.merge_and_unload() # this line troubled me.
model = model.base_model

this script may make trouble.

Now, I reuse merge_peft_adapter.py obediently.

merge hf model with se model to get merged_se_model, and then merge merged_se_model with rl model to get merged_rl_model.

Thus, the outputs are never error codes.

Below is a example of output

Question: PyTorch: How to get the shape of a Tensor as a list of int

Answer: You can use `torch.tensor(shape).to_numpy()` or `torch.from_numpy(np.array([1,2])).shape`

Thanks again @BIGPPWONG @younesbelkada

@younesbelkada
Copy link
Contributor

Awesome! This is great to hear!
Feel free to close the issue!

@vpegasus
Copy link
Author

Ok, thanks again!

@prasad4fun
Copy link

Awesome! This is great to hear! Feel free to close the issue!

Hi @younesbelkada , i used below code for inference trained model.

    model = AutoModelForCausalLMWithValueHead.from_pretrained(rl_model_pth, 
                                                          load_in_8bit=True,
                                                          peft_config=lora_config,).to(device)
    tokenizer = LlamaTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
    query_txt = "<random question here>"
    query_tensor = tokenizer.encode(query_txt, return_tensors="pt").to(device)

    response_tensor = respond_to_batch(model, query_tensor)
    response_txt = tokenizer.decode(response_tensor[0,:])

The response seem to be mostly short, whereas in stack-llama rl_training.py i could see ppo_trainer.generate in combination of length sampler getting used for text generation.
Also in gradio demo, there are additional params(temperature, max new tokens..) to control generation.

Could you please suggest how to use these additional params, respond_to_batch seem to be limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants