Skip to content

Commit 5c2fffc

Browse files
authored
Merge branch 'ggml-org:master' into master
2 parents da6d8ba + 2434535 commit 5c2fffc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+4911
-3056
lines changed

.github/workflows/build.yml

+3-2
Original file line numberDiff line numberDiff line change
@@ -601,8 +601,9 @@ jobs:
601601
-DGGML_SYCL_F16=ON
602602
cmake --build build --config Release -j $(nproc)
603603
604-
build-linux-cross:
605-
uses: ./.github/workflows/build-linux-cross.yml
604+
# Disabled for now due to sporadic issue syncing.
605+
# build-linux-cross:
606+
# uses: ./.github/workflows/build-linux-cross.yml
606607

607608
macOS-latest-cmake-ios:
608609
runs-on: macos-latest

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
1616

1717
## Hot topics
1818

19+
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli` and `gemma3-cli` https://github.com/ggml-org/llama.cpp/pull/13012, `libllava` will be deprecated
1920
- **How to use [MTLResidencySet](https://developer.apple.com/documentation/metal/mtlresidencyset?language=objc) to keep the GPU memory active?** https://github.com/ggml-org/llama.cpp/pull/11427
2021
- **VS Code extension for FIM completions:** https://github.com/ggml-org/llama.vscode
2122
- Universal [tool call support](./docs/function-calling.md) in `llama-server` https://github.com/ggml-org/llama.cpp/pull/9639

common/arg.cpp

+2-3
Original file line numberDiff line numberDiff line change
@@ -976,14 +976,13 @@ static void common_params_print_completion(common_params_context & ctx_arg) {
976976
"llama-gritlm",
977977
"llama-imatrix",
978978
"llama-infill",
979-
"llama-llava-cli",
979+
"llama-mtmd-cli",
980980
"llama-llava-clip-quantize-cli",
981981
"llama-lookahead",
982982
"llama-lookup",
983983
"llama-lookup-create",
984984
"llama-lookup-merge",
985985
"llama-lookup-stats",
986-
"llama-minicpmv-cli",
987986
"llama-parallel",
988987
"llama-passkey",
989988
"llama-perplexity",
@@ -2726,7 +2725,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
27262725
[](common_params & params, const std::string & value) {
27272726
params.chat_template = value;
27282727
}
2729-
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_CHAT_TEMPLATE"));
2728+
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_LLAVA}).set_env("LLAMA_ARG_CHAT_TEMPLATE"));
27302729
add_opt(common_arg(
27312730
{"--chat-template-file"}, "JINJA_TEMPLATE_FILE",
27322731
string_format(

convert_hf_to_gguf.py

+363-271
Large diffs are not rendered by default.

convert_lora_to_gguf.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
import gguf
2525

2626
# reuse model definitions from convert_hf_to_gguf.py
27-
from convert_hf_to_gguf import LazyTorchTensor, Model
27+
from convert_hf_to_gguf import LazyTorchTensor, ModelBase
2828

2929
logger = logging.getLogger("lora-to-gguf")
3030

@@ -340,11 +340,11 @@ def load_hparams_from_hf(hf_model_id: str) -> dict[str, Any]:
340340
sys.exit(1)
341341
else:
342342
logger.info(f"Loading base model: {dir_base_model.name}")
343-
hparams = Model.load_hparams(dir_base_model)
343+
hparams = ModelBase.load_hparams(dir_base_model)
344344

345345
with torch.inference_mode():
346346
try:
347-
model_class = Model.from_model_architecture(hparams["architectures"][0])
347+
model_class = ModelBase.from_model_architecture(hparams["architectures"][0])
348348
except NotImplementedError:
349349
logger.error(f"Model {hparams['architectures'][0]} is not supported")
350350
sys.exit(1)

examples/llava/MobileVLM-README.md renamed to docs/multimodal/MobileVLM.md

+13-13
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@ The implementation is based on llava, and is compatible with llava and mobileVLM
99
Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown.
1010

1111
## Usage
12-
Build with cmake or run `make llama-llava-cli` to build it.
1312

14-
After building, run: `./llama-llava-cli` to see the usage. For example:
13+
Build the `llama-mtmd-cli` binary.
14+
15+
After building, run: `./llama-mtmd-cli` to see the usage. For example:
1516

1617
```sh
17-
./llama-llava-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
18+
./llama-mtmd-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
1819
--mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
19-
--image path/to/an/image.jpg \
20-
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:"
20+
--chat-template deepseek
2121
```
2222

2323
## Model conversion
@@ -82,7 +82,7 @@ refer to `android/adb_run.sh`, modify resources' `name` and `path`
8282
### case 1
8383
**input**
8484
```sh
85-
/data/local/tmp/llama-llava-cli \
85+
/data/local/tmp/llama-mtmd-cli \
8686
-m /data/local/tmp/ggml-model-q4_k.gguf \
8787
--mmproj /data/local/tmp/mmproj-model-f16.gguf \
8888
-t 4 \
@@ -102,7 +102,7 @@ llama_print_timings: total time = 34731.93 ms
102102
### case 2
103103
**input**
104104
```sh
105-
/data/local/tmp/llama-llava-cli \
105+
/data/local/tmp/llama-mtmd-cli \
106106
-m /data/local/tmp/ggml-model-q4_k.gguf \
107107
--mmproj /data/local/tmp/mmproj-model-f16.gguf \
108108
-t 4 \
@@ -123,10 +123,10 @@ llama_print_timings: total time = 34570.79 ms
123123

124124
## Some result on Android with `Snapdragon 778G` chip
125125
### MobileVLM-1.7B case
126-
#### llava-cli release-b2005
126+
#### mtmd-cli release-b2005
127127
**input**
128128
```sh
129-
/data/local/tmp/llama-llava-cli \
129+
/data/local/tmp/llama-mtmd-cli \
130130
-m /data/local/tmp/ggml-model-q4_k.gguf \
131131
--mmproj /data/local/tmp/mmproj-model-f16.gguf \
132132
-t 4 \
@@ -147,7 +147,7 @@ llama_print_timings: prompt eval time = 8119.49 ms / 191 tokens ( 42.51 m
147147
llama_print_timings: eval time = 1005.75 ms / 14 runs ( 71.84 ms per token, 13.92 tokens per second)
148148
llama_print_timings: total time = 28038.34 ms / 205 tokens
149149
```
150-
#### llava-cli latest-version
150+
#### mtmd-cli latest-version
151151
**input**
152152

153153
Just the same as above.
@@ -169,7 +169,7 @@ llama_print_timings: eval time = 43894.02 ms / 13 runs ( 3376.46 m
169169
llama_print_timings: total time = 865441.76 ms / 204 tokens
170170
```
171171
### MobileVLM_V2-1.7B case
172-
#### llava-cli release-2005b
172+
#### mtmd-cli release-2005b
173173
**input**
174174

175175
Just the same as above.
@@ -200,7 +200,7 @@ make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 GGML_CUDA_F16=1 -j 32
200200
### case 1
201201
**input**
202202
```sh
203-
./llama-llava-cli \
203+
./llama-mtmd-cli \
204204
-m /data/local/tmp/ggml-model-q4_k.gguf \
205205
--mmproj /data/local/tmp/mmproj-model-f16.gguf \
206206
--image /data/local/tmp/demo.jpeg \
@@ -224,7 +224,7 @@ llama_print_timings: total time = 1352.63 ms / 252 tokens
224224
### case 2
225225
**input**
226226
```sh
227-
./llama-llava-cli \
227+
./llama-mtmd-cli \
228228
-m /data/local/tmp/ggml-model-q4_k.gguf \
229229
--mmproj /data/local/tmp/mmproj-model-f16.gguf \
230230
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:" \

examples/llava/README-gemma3.md renamed to docs/multimodal/gemma3.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,12 @@ llama-gemma3-cli -hf ggml-org/gemma-3-27b-it-GGUF
2626

2727
## How to get mmproj.gguf?
2828

29+
Simply to add `--mmproj` in when converting model via `convert_hf_to_gguf.py`:
30+
2931
```bash
3032
cd gemma-3-4b-it
31-
python ../llama.cpp/examples/llava/gemma3_convert_encoder_to_gguf.py .
32-
33-
# output file is mmproj.gguf
33+
python ../llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 --mmproj .
34+
# output file: mmproj-model.gguf
3435
```
3536

3637
## How to run it?

examples/llava/README-glmedge.md renamed to docs/multimodal/glmedge.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@
33
Currently this implementation supports [glm-edge-v-2b](https://huggingface.co/THUDM/glm-edge-v-2b) and [glm-edge-v-5b](https://huggingface.co/THUDM/glm-edge-v-5b).
44

55
## Usage
6-
Build with cmake or run `make llama-llava-cli` to build it.
6+
Build the `llama-mtmd-cli` binary.
77

8-
After building, run: `./llama-llava-cli` to see the usage. For example:
8+
After building, run: `./llama-mtmd-cli` to see the usage. For example:
99

1010
```sh
11-
./llama-llava-cli -m model_path/ggml-model-f16.gguf --mmproj model_path/mmproj-model-f16.gguf --image img_path/image.jpg -p "<|system|>\n system prompt <image><|user|>\n prompt <|assistant|>\n"
11+
./llama-mtmd-cli -m model_path/ggml-model-f16.gguf --mmproj model_path/mmproj-model-f16.gguf
1212
```
1313

1414
**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.

examples/llava/README-granitevision.md renamed to docs/multimodal/granitevision.md

+2-6
Original file line numberDiff line numberDiff line change
@@ -176,15 +176,11 @@ Note that currently you cannot quantize the visual encoder because granite visio
176176

177177

178178
### 5. Running the Model in Llama cpp
179-
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
179+
Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
180180

181181
```bash
182-
$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
182+
$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \
183183
--mmproj $VISUAL_GGUF_PATH \
184-
--image ./media/llama0-banner.png \
185184
-c 16384 \
186-
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat does the text in this image say?\n<|assistant|>\n" \
187185
--temp 0
188186
```
189-
190-
Sample output: `The text in the image reads "LLAMA C++ Can it run DOOM Llama?"`

docs/multimodal/llava.md

+143
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# LLaVA
2+
3+
Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants,
4+
as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants.
5+
6+
The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
7+
and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
8+
models are available.
9+
For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf)
10+
11+
After API is confirmed, more models will be supported / uploaded.
12+
13+
## Usage
14+
Build the `llama-mtmd-cli` binary.
15+
16+
After building, run: `./llama-mtmd-cli` to see the usage. For example:
17+
18+
```sh
19+
./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \
20+
--mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \
21+
--chat-template vicuna
22+
```
23+
24+
**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
25+
**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
26+
27+
## LLaVA 1.5
28+
29+
1. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
30+
31+
```sh
32+
git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
33+
34+
git clone https://huggingface.co/openai/clip-vit-large-patch14-336
35+
```
36+
37+
2. Install the required Python packages:
38+
39+
```sh
40+
pip install -r examples/llava/requirements.txt
41+
```
42+
43+
3. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
44+
45+
```sh
46+
python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b
47+
```
48+
49+
4. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
50+
51+
```sh
52+
python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
53+
```
54+
55+
5. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
56+
57+
```sh
58+
python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
59+
```
60+
61+
Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
62+
63+
## LLaVA 1.6 gguf conversion
64+
1) First clone a LLaVA 1.6 model:
65+
```console
66+
git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
67+
```
68+
69+
2) Install the required Python packages:
70+
71+
```sh
72+
pip install -r examples/llava/requirements.txt
73+
```
74+
75+
3) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
76+
```console
77+
python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
78+
```
79+
- you will find a llava.projector and a llava.clip file in your model directory
80+
81+
4) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
82+
```console
83+
mkdir vit
84+
cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
85+
cp ../llava-v1.6-vicuna-7b/llava.projector vit/
86+
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
87+
```
88+
89+
5) Create the visual gguf model:
90+
```console
91+
python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
92+
```
93+
- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
94+
95+
6) Then convert the model to gguf format:
96+
```console
97+
python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
98+
```
99+
100+
7) And finally we can run the llava cli using the 1.6 model version:
101+
```console
102+
./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf
103+
```
104+
105+
**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
106+
107+
**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
108+
109+
**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
110+
111+
```python
112+
import os
113+
import transformers
114+
115+
model_path = ...
116+
llm_export_path = ...
117+
118+
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
119+
model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
120+
121+
tokenizer.save_pretrained(llm_export_path)
122+
model.language_model.save_pretrained(llm_export_path)
123+
```
124+
125+
Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
126+
127+
## Chat template
128+
129+
For llava-1.5 and llava-1.6, you need to use `vicuna` chat template. Simply add `--chat-template vicuna` to activate this template.
130+
131+
132+
## How to know if you are running in llava-1.5 or llava-1.6 mode
133+
134+
When running llava-cli you will see a visual information right before the prompt is being processed:
135+
136+
**Llava-1.5:**
137+
`encode_image_with_clip: image embedding created: 576 tokens`
138+
139+
**Llava-1.6 (anything above 576):**
140+
`encode_image_with_clip: image embedding created: 2880 tokens`
141+
142+
143+
Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6

examples/llava/README-minicpmo2.6.md renamed to docs/multimodal/minicpmo2.6.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ python ./convert_hf_to_gguf.py ../MiniCPM-o-2_6/model
4040

4141
Inference on Linux or Mac
4242
```bash
43-
# run f16 version
44-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
43+
# run in single-turn mode
44+
./build/bin/llama-mtmd-cli -m ../MiniCPM-o-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
4545

46-
# run quantized int4 version
47-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
46+
# run in conversation mode
47+
./build/bin/llama-mtmd-cli -m ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf
4848
```

examples/llava/README-minicpmv2.5.md renamed to docs/multimodal/minicpmv2.5.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,9 @@ python ./convert_hf_to_gguf.py ../MiniCPM-Llama3-V-2_5/model
3939

4040
Inference on Linux or Mac
4141
```bash
42-
# run f16 version
43-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
42+
# run in single-turn mode
43+
./build/bin/llama-mtmd-cli -m ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
4444

45-
# run quantized int4 version
46-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
45+
# run in conversation mode
46+
./build/bin/llama-mtmd-cli -m ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf
4747
```

examples/llava/README-minicpmv2.6.md renamed to docs/multimodal/minicpmv2.6.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,9 @@ python ./convert_hf_to_gguf.py ../MiniCPM-V-2_6/model
3939

4040
Inference on Linux or Mac
4141
```bash
42-
# run f16 version
43-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-V-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
42+
# run in single-turn mode
43+
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
4444

45-
# run quantized int4 version
46-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
45+
# run in conversation mode
46+
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf
4747
```

0 commit comments

Comments
 (0)