Skip to content

Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Nexesenex opened this issue Sep 8, 2023 · 12 comments
Closed

Comments

@Nexesenex
Copy link

Nexesenex commented Sep 8, 2023

CodeLlama 2 models are loaded with an automatic rope base frequency similar to Llama 2 when the rope is not specificed in the command line launch.
But the initial Base Rope frequency for CL2 is 1000000, not 10000.

I couldn't find nor figure out the formula to calculate a proper rope base frequency for CL2 accordingly to context length (if you have some ideads..), I'm lame in algebra, but from empirical perplexity tests, the best base rope frequency seem to revolve around 100000 if the rope scale is left at 1 up to a context of 12288.

I observed that the variance between 10000, 100000 and 1000000 is a curve with 0.2 perplexity amplitude at 512 ctx and 0.02 perplexity around 12288, with 100000 having the lowest perplexity.

I could make more tests on a 7b model with a proper command/script logging on llama.cpp the perplexities found with different rope base frequency/scale config up to 32768 or even higher, as some developpers seem to use on ggermanov reddit, but I didn't find the script (and I'm on Windows).

Once Johannes Gaessler PR about the kv cache quantized in q8_0 is accepted, we can probably test up to 100,000 ctx on 7b with a single 24GB graphic card.

@Nexesenex Nexesenex changed the title Ideal Rope for CodeLLama 2 based models Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. Sep 8, 2023
@SabinStargem
Copy link

SabinStargem commented Sep 9, 2023

Thanks for the info. My copy of Airoboros c34b has been more intelligent after applying your ROPE to it. I was wondering why it was a bit dopey. Hopefully the auto-rope can be tweaked to properly handle 34b, especially as we use 16k+ and aim for that 100,000 context someday.

EDIT: I posted a copy of your name and quote onto the LlamaCPP repository, just in case the issue affects that project.

@LostRuins
Copy link
Owner

The auto rope should be handled correctly in the latest version (1.43). What value do you see for n_train_ctx? It applies a secondary scaling the the final rope value, what value do you see being used for automatic rope?

@SabinStargem
Copy link

SabinStargem commented Sep 9, 2023

Using automatic RoPE scaling (scale:1.000, base:26000.0)
llm_load_print_meta: n_ctx_train = 16384

Question: Is the value "1.0e-05" in my log correct? There is a LlamaCPP thread where Slaren said this:

Slaren

The CodeLlama models can now be converted to gguf using convert.py, but to operate properly they require the parameter --rope-freq-base 1e6. This parameter needs to be added to the gguf model file metadata.

Here is my entire log.


Welcome to KoboldCpp - Version 1.43
For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

Overriding thread count, using 6 threads instead.
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=6, config=None, contextsize=16384, debugmode=False, forceversion=0, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/airoboros-c34b-2.1.Q6_K.gguf', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=True, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=6, unbantokens=False, useclblast=None, usecublas=['normal', '0', 'mmq'], usemirostat=None, usemlock=True)

Loading model: C:\KoboldCPP\Models\airoboros-c34b-2.1.Q6_K.gguf
[Threads: 6, BlasThreads: 6, SmartContext: False]


Identified as LLAMA model: (ver 6)
Attempting to Load...

Using automatic RoPE scaling (scale:1.000, base:26000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from C:\KoboldCPP\Models\airoboros-c34b-2.1.Q6_K.gg�会oッ瑛lm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 22016
llm_load_print_meta: freq_base = 26000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model size = 33.74 B
llm_load_print_meta: general.name = jondurbin_airoboros-c34b-2.1
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.14 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 26400.83 MB (+ 3072.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/51 layers to GPU
llm_load_tensors: VRAM used: 0 MB
....................................................................................................
llama_new_context_with_model: kv self size = 3072.00 MB
llama_new_context_with_model: compute buffer total size = 8385.48 MB
llama_new_context_with_model: VRAM scratch buffer: 8384.01 MB
Load Model OK: True
Embedded Kobold Lite loaded.

Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

@LostRuins
Copy link
Owner

Ah okay I see what you mean. The correct base for this scenario should be 10000, so I probably have a bug that needs to be fixed

@Nexesenex
Copy link
Author

@LostRuins : About bugs to be fixed, there's one just fixed for you 👍

#374 (comment)

@SabinStargem : You are welcome. And thank you for all your inquiries and comments all around Llama. I spotted you a while ago, and when I have a question, I usually check if you asked it before me, this for weeks already! ^^

@SabinStargem
Copy link

According to Kerfluffle at LlamaCPP, I misunderstood the correlation between ems_rope and what Slaren said. Good. One less thing to be fixed.

@Nexesenex
Copy link
Author

@SabinStargem : here's something to try for you.
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.43.b1204e
With a small fix for cuda mmq, so you can grab more context length on your GPU.

As for the epsilon values, they are not to be confused to the theta value (the "initial rope"). Here's a read about it : ggml-org#2384

@WolframRavenwolf
Copy link

WolframRavenwolf commented Sep 10, 2023

So what's the correct ropeconfig for CodeLlama 2?

KoboldCpp's autodetection sets ropeconfig=[0.0, 10000.0] while the metadata says:

  • llm_load_print_meta: n_ctx_train = 16384
  • llm_load_print_meta: freq_base = 1000000.0
  • llm_load_print_meta: freq_scale = 1

So I'm manually loading it with --contextsize 16384 --ropeconfig 1 1000000 - is that right?

By the way, the 0.0 instead of 1.0 for autodetected frequency scale looks weird to me as well. (That's why I always manually set --contextsize 4096 --ropeconfig 1 10000 with the other 4K Llama 2 models.)

Would be great if by default KoboldCpp would set contextsize and ropeconfig properly according to the GGUF metadata.

@Nexesenex
Copy link
Author

Nexesenex commented Sep 10, 2023

For CodeLLama 2, 1 100000, not 1 1000000, and this for up to 16384 ctx. I might make more tests on higher contexts in order to see at which point the rope base frequency needs to be elevated beyond 100000 to go towards the Theta value (1000000, which is a roof on CL2, and not a floor like the 10000 of L1 and L2).

As for the zero scale, I didn't test, but I'd suggest to put 1 as you are doing already, no matter what zero means for KoboldCPP : 1 will always be 1, while zero can either mean zero, either no factor (hence, 1), I didn't check the code to see. ^^

@LostRuins
Copy link
Owner

The auto rope scale has been improved in v1.44, which has just been released. The training context of the model should now be applied correctly as a scale to expected rope base.

@WolframRavenwolf
Copy link

WolframRavenwolf commented Sep 20, 2023

@LostRuins Thanks, not having to specify the RoPE scale anymore makes things much easier.

But the auto-detected value for Code Llama 2 is 1000000.0 - which is different from what @Nexesenex claimed in the post above yours. So which is the proper value?

@SabinStargem
Copy link

Awhile back someone pointed me to the official llama repositories, 34b is indeed 1,000,000.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue May 24, 2025
* iq5_ks_r4: basics

* iq5_ks_r4: Zen4 works

* iq5_ks_r4: AVX2 works

* iq5_ks_r4: NEON

* Fix iq5_ks on NEON

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants