Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #426

Nexesenex · 2023-09-08T15:22:16Z

CodeLlama 2 models are loaded with an automatic rope base frequency similar to Llama 2 when the rope is not specificed in the command line launch.
But the initial Base Rope frequency for CL2 is 1000000, not 10000.

I couldn't find nor figure out the formula to calculate a proper rope base frequency for CL2 accordingly to context length (if you have some ideads..), I'm lame in algebra, but from empirical perplexity tests, the best base rope frequency seem to revolve around 100000 if the rope scale is left at 1 up to a context of 12288.

I observed that the variance between 10000, 100000 and 1000000 is a curve with 0.2 perplexity amplitude at 512 ctx and 0.02 perplexity around 12288, with 100000 having the lowest perplexity.

I could make more tests on a 7b model with a proper command/script logging on llama.cpp the perplexities found with different rope base frequency/scale config up to 32768 or even higher, as some developpers seem to use on ggermanov reddit, but I didn't find the script (and I'm on Windows).

Once Johannes Gaessler PR about the kv cache quantized in q8_0 is accepted, we can probably test up to 100,000 ctx on 7b with a single 24GB graphic card.

SabinStargem · 2023-09-09T01:00:18Z

Thanks for the info. My copy of Airoboros c34b has been more intelligent after applying your ROPE to it. I was wondering why it was a bit dopey. Hopefully the auto-rope can be tweaked to properly handle 34b, especially as we use 16k+ and aim for that 100,000 context someday.

EDIT: I posted a copy of your name and quote onto the LlamaCPP repository, just in case the issue affects that project.

LostRuins · 2023-09-09T08:05:22Z

The auto rope should be handled correctly in the latest version (1.43). What value do you see for n_train_ctx? It applies a secondary scaling the the final rope value, what value do you see being used for automatic rope?

SabinStargem · 2023-09-09T11:44:18Z

Using automatic RoPE scaling (scale:1.000, base:26000.0)
llm_load_print_meta: n_ctx_train = 16384

Question: Is the value "1.0e-05" in my log correct? There is a LlamaCPP thread where Slaren said this:

Slaren

The CodeLlama models can now be converted to gguf using convert.py, but to operate properly they require the parameter --rope-freq-base 1e6. This parameter needs to be added to the gguf model file metadata.

Here is my entire log.

Welcome to KoboldCpp - Version 1.43
For command line arguments, please refer to --help

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

Overriding thread count, using 6 threads instead.
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=6, config=None, contextsize=16384, debugmode=False, forceversion=0, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/airoboros-c34b-2.1.Q6_K.gguf', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=True, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=6, unbantokens=False, useclblast=None, usecublas=['normal', '0', 'mmq'], usemirostat=None, usemlock=True)

Loading model: C:\KoboldCPP\Models\airoboros-c34b-2.1.Q6_K.gguf
[Threads: 6, BlasThreads: 6, SmartContext: False]

Identified as LLAMA model: (ver 6)
Attempting to Load...

Using automatic RoPE scaling (scale:1.000, base:26000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from C:\KoboldCPP\Models\airoboros-c34b-2.1.Q6_K.gg�会oｯ瑛lm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 22016
llm_load_print_meta: freq_base = 26000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model size = 33.74 B
llm_load_print_meta: general.name = jondurbin_airoboros-c34b-2.1
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.14 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 26400.83 MB (+ 3072.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/51 layers to GPU
llm_load_tensors: VRAM used: 0 MB
....................................................................................................
llama_new_context_with_model: kv self size = 3072.00 MB
llama_new_context_with_model: compute buffer total size = 8385.48 MB
llama_new_context_with_model: VRAM scratch buffer: 8384.01 MB
Load Model OK: True
Embedded Kobold Lite loaded.

Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

LostRuins · 2023-09-09T12:18:17Z

Ah okay I see what you mean. The correct base for this scenario should be 10000, so I probably have a bug that needs to be fixed

Nexesenex · 2023-09-09T12:55:23Z

@LostRuins : About bugs to be fixed, there's one just fixed for you 👍

#374 (comment)

@SabinStargem : You are welcome. And thank you for all your inquiries and comments all around Llama. I spotted you a while ago, and when I have a question, I usually check if you asked it before me, this for weeks already! ^^

SabinStargem · 2023-09-09T15:53:11Z

According to Kerfluffle at LlamaCPP, I misunderstood the correlation between ems_rope and what Slaren said. Good. One less thing to be fixed.

Nexesenex · 2023-09-10T00:04:08Z

@SabinStargem : here's something to try for you.
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.43.b1204e
With a small fix for cuda mmq, so you can grab more context length on your GPU.

As for the epsilon values, they are not to be confused to the theta value (the "initial rope"). Here's a read about it : ggml-org#2384

WolframRavenwolf · 2023-09-10T00:07:13Z

So what's the correct ropeconfig for CodeLlama 2?

KoboldCpp's autodetection sets ropeconfig=[0.0, 10000.0] while the metadata says:

llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: freq_base = 1000000.0
llm_load_print_meta: freq_scale = 1

So I'm manually loading it with --contextsize 16384 --ropeconfig 1 1000000 - is that right?

By the way, the 0.0 instead of 1.0 for autodetected frequency scale looks weird to me as well. (That's why I always manually set --contextsize 4096 --ropeconfig 1 10000 with the other 4K Llama 2 models.)

Would be great if by default KoboldCpp would set contextsize and ropeconfig properly according to the GGUF metadata.

Nexesenex · 2023-09-10T00:43:28Z

For CodeLLama 2, 1 100000, not 1 1000000, and this for up to 16384 ctx. I might make more tests on higher contexts in order to see at which point the rope base frequency needs to be elevated beyond 100000 to go towards the Theta value (1000000, which is a roof on CL2, and not a floor like the 10000 of L1 and L2).

As for the zero scale, I didn't test, but I'd suggest to put 1 as you are doing already, no matter what zero means for KoboldCPP : 1 will always be 1, while zero can either mean zero, either no factor (hence, 1), I didn't check the code to see. ^^

LostRuins · 2023-09-20T10:40:54Z

The auto rope scale has been improved in v1.44, which has just been released. The training context of the model should now be applied correctly as a scale to expected rope base.

WolframRavenwolf · 2023-09-20T19:55:01Z

@LostRuins Thanks, not having to specify the RoPE scale anymore makes things much easier.

But the auto-detected value for Code Llama 2 is 1000000.0 - which is different from what @Nexesenex claimed in the post above yours. So which is the proper value?

SabinStargem · 2023-09-20T21:31:16Z

Awhile back someone pointed me to the official llama repositories, 34b is indeed 1,000,000.

* iq5_ks_r4: basics * iq5_ks_r4: Zen4 works * iq5_ks_r4: AVX2 works * iq5_ks_r4: NEON * Fix iq5_ks on NEON --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Nexesenex changed the title ~~Ideal Rope for CodeLLama 2 based models~~ Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. Sep 8, 2023

LostRuins closed this as completed Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #426

Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #426

Nexesenex commented Sep 8, 2023 •

edited

Loading

SabinStargem commented Sep 9, 2023 •

edited

Loading

Uh oh!

LostRuins commented Sep 9, 2023

Uh oh!

SabinStargem commented Sep 9, 2023 •

edited

Loading

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

Identified as LLAMA model: (ver 6)
Attempting to Load...

Uh oh!

LostRuins commented Sep 9, 2023

Uh oh!

Nexesenex commented Sep 9, 2023

Uh oh!

SabinStargem commented Sep 9, 2023

Uh oh!

Nexesenex commented Sep 10, 2023

Uh oh!

WolframRavenwolf commented Sep 10, 2023 •

edited

Loading

Uh oh!

Nexesenex commented Sep 10, 2023 •

edited

Loading

Uh oh!

LostRuins commented Sep 20, 2023

Uh oh!

WolframRavenwolf commented Sep 20, 2023 •

edited

Loading

Uh oh!

SabinStargem commented Sep 20, 2023

Uh oh!

Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #426

Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #426

Comments

Nexesenex commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SabinStargem commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented Sep 9, 2023

Uh oh!

SabinStargem commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Identified as LLAMA model: (ver 6) Attempting to Load...

Uh oh!

LostRuins commented Sep 9, 2023

Uh oh!

Nexesenex commented Sep 9, 2023

Uh oh!

SabinStargem commented Sep 9, 2023

Uh oh!

Nexesenex commented Sep 10, 2023

Uh oh!

WolframRavenwolf commented Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nexesenex commented Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented Sep 20, 2023

Uh oh!

WolframRavenwolf commented Sep 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SabinStargem commented Sep 20, 2023

Uh oh!

Nexesenex commented Sep 8, 2023 •

edited

Loading

SabinStargem commented Sep 9, 2023 •

edited

Loading

SabinStargem commented Sep 9, 2023 •

edited

Loading

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

Identified as LLAMA model: (ver 6)
Attempting to Load...

WolframRavenwolf commented Sep 10, 2023 •

edited

Loading

Nexesenex commented Sep 10, 2023 •

edited

Loading

WolframRavenwolf commented Sep 20, 2023 •

edited

Loading