-
Notifications
You must be signed in to change notification settings - Fork 476
Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the info. My copy of Airoboros c34b has been more intelligent after applying your ROPE to it. I was wondering why it was a bit dopey. Hopefully the auto-rope can be tweaked to properly handle 34b, especially as we use 16k+ and aim for that 100,000 context someday. EDIT: I posted a copy of your name and quote onto the LlamaCPP repository, just in case the issue affects that project. |
The auto rope should be handled correctly in the latest version (1.43). What value do you see for n_train_ctx? It applies a secondary scaling the the final rope value, what value do you see being used for automatic rope? |
Using automatic RoPE scaling (scale:1.000, base:26000.0) Question: Is the value "1.0e-05" in my log correct? There is a LlamaCPP thread where Slaren said this:
Here is my entire log.
Starting Kobold HTTP Server on port 5001 |
Ah okay I see what you mean. The correct base for this scenario should be 10000, so I probably have a bug that needs to be fixed |
@LostRuins : About bugs to be fixed, there's one just fixed for you 👍 @SabinStargem : You are welcome. And thank you for all your inquiries and comments all around Llama. I spotted you a while ago, and when I have a question, I usually check if you asked it before me, this for weeks already! ^^ |
According to Kerfluffle at LlamaCPP, I misunderstood the correlation between ems_rope and what Slaren said. Good. One less thing to be fixed. |
@SabinStargem : here's something to try for you. As for the epsilon values, they are not to be confused to the theta value (the "initial rope"). Here's a read about it : ggml-org#2384 |
So what's the correct ropeconfig for CodeLlama 2? KoboldCpp's autodetection sets ropeconfig=[0.0, 10000.0] while the metadata says:
So I'm manually loading it with By the way, the 0.0 instead of 1.0 for autodetected frequency scale looks weird to me as well. (That's why I always manually set Would be great if by default KoboldCpp would set contextsize and ropeconfig properly according to the GGUF metadata. |
For CodeLLama 2, 1 100000, not 1 1000000, and this for up to 16384 ctx. I might make more tests on higher contexts in order to see at which point the rope base frequency needs to be elevated beyond 100000 to go towards the Theta value (1000000, which is a roof on CL2, and not a floor like the 10000 of L1 and L2). As for the zero scale, I didn't test, but I'd suggest to put 1 as you are doing already, no matter what zero means for KoboldCPP : 1 will always be 1, while zero can either mean zero, either no factor (hence, 1), I didn't check the code to see. ^^ |
The auto rope scale has been improved in v1.44, which has just been released. The training context of the model should now be applied correctly as a scale to expected rope base. |
@LostRuins Thanks, not having to specify the RoPE scale anymore makes things much easier. But the auto-detected value for Code Llama 2 is 1000000.0 - which is different from what @Nexesenex claimed in the post above yours. So which is the proper value? |
Awhile back someone pointed me to the official llama repositories, 34b is indeed 1,000,000. |
* iq5_ks_r4: basics * iq5_ks_r4: Zen4 works * iq5_ks_r4: AVX2 works * iq5_ks_r4: NEON * Fix iq5_ks on NEON --------- Co-authored-by: Iwan Kawrakow <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
CodeLlama 2 models are loaded with an automatic rope base frequency similar to Llama 2 when the rope is not specificed in the command line launch.
But the initial Base Rope frequency for CL2 is 1000000, not 10000.
I couldn't find nor figure out the formula to calculate a proper rope base frequency for CL2 accordingly to context length (if you have some ideads..), I'm lame in algebra, but from empirical perplexity tests, the best base rope frequency seem to revolve around 100000 if the rope scale is left at 1 up to a context of 12288.
I observed that the variance between 10000, 100000 and 1000000 is a curve with 0.2 perplexity amplitude at 512 ctx and 0.02 perplexity around 12288, with 100000 having the lowest perplexity.
I could make more tests on a 7b model with a proper command/script logging on llama.cpp the perplexities found with different rope base frequency/scale config up to 32768 or even higher, as some developpers seem to use on ggermanov reddit, but I didn't find the script (and I'm on Windows).
Once Johannes Gaessler PR about the kv cache quantized in q8_0 is accepted, we can probably test up to 100,000 ctx on 7b with a single 24GB graphic card.
The text was updated successfully, but these errors were encountered: