-
Notifications
You must be signed in to change notification settings - Fork 54
Add support for bitnet2b_2501 model #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I fetched the model from https://huggingface.co/microsoft/bitnet-b1.58-2B-4T When I try to run
|
And after noticing that it is now "BitNetForCausalLM" instead of "BitnetForCausalLM" and fixing it, I get
|
I can reproduce the issue with the safetensors conversion, but using the method outlined in #169 I was able to get it running.
Full log inside
I even ran it with the same prompt that you ran on the other bitnet's.
Full log inside
They seem to have a seperate script in the PR that converts the model but I'm having issues using that script with it placed in ik_llama.cpp as it hooks into gguf-py. (Well first, I had to comment out the torch compile on line 948 which did not work as I have CPU only triton on that system.) It hit this error.
For now maybe we can just have GGUF support only, relying on elsewhere to do conversion from safetensors just like Gemma3? Edit: Peak speed for me is at 24 threads, would be curious to see it on your machines since you have a lot of comparitive numbers.
Edit 2: Pushed the python fix for the new name even if that file still doesn't work. I don't see a point of pushing the standalone file since I still can't get that to work either. If they are going to have a standalone file, we may as well tell people to grab a GGUF (I could even upload one for this model it's small enough). Edit 3: Even higher speeds with the R4 variant.
Edit 4: Using bench to see some numbers of both, where now 48 seems better again, showing best result for both R4 and normal variants.
Very informal testing, no dropping of cache, or other precautions taken. Edit 5: It is available on huggingface here. Edit 6: Another informal benchmark, this time sweep bench.
|
Yes, I got it running by converting the
Is |
Here
|
It does seem to have an issue using EOS tokens and stopping generation, so there is an issue. |
Here the results of the official Microsoft BitNet implementation (build a8ac7072, just pulled)
BitNet is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge like this. It is fine to just use I2_S
GGUFs. We can sort out the pre-tokenizer issue later.
Okay. I'll make an issue. I tested the model more, it is coherent, and can even do multi turn conversation, it just doesn't ever use an EOS token and so it never stops it's own generation it will just continue until I stopped it, and I still don't really understand it's chat template:
|
I couldn't get flash attention running, it would always just exit with |
Something is missing in the logic for your number of threads. The model has a strange number of attention heads - 20 in total and 5 KV heads. I'm working on a better strategy for distributing the work between the threads. |
I see, yes I can get it working with 16 and 32 threads, but I can't give performance numbers now as I can't drop my caches right now. |
Very direct port of microsoft/BitNet#167 more specifically this commit, Eddie-Wang1120/llama.cpp@a8ac707
I had to do some minor additional fixes, it now compiles.
I have not ran the model yet.