|
| 1 | +# LLaVA |
| 2 | + |
| 3 | +Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants, |
| 4 | +as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants. |
| 5 | + |
| 6 | +The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b) |
| 7 | +and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b) |
| 8 | +models are available. |
| 9 | +For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf) |
| 10 | + |
| 11 | +After API is confirmed, more models will be supported / uploaded. |
| 12 | + |
| 13 | +## Usage |
| 14 | +Build the `llama-mtmd-cli` binary. |
| 15 | + |
| 16 | +After building, run: `./llama-mtmd-cli` to see the usage. For example: |
| 17 | + |
| 18 | +```sh |
| 19 | +./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \ |
| 20 | + --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \ |
| 21 | + --chat-template vicuna |
| 22 | +``` |
| 23 | + |
| 24 | +**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so. |
| 25 | +**note**: For GPU offloading ensure to use the `-ngl` flag just like usual |
| 26 | + |
| 27 | +## LLaVA 1.5 |
| 28 | + |
| 29 | +1. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example: |
| 30 | + |
| 31 | +```sh |
| 32 | +git clone https://huggingface.co/liuhaotian/llava-v1.5-7b |
| 33 | + |
| 34 | +git clone https://huggingface.co/openai/clip-vit-large-patch14-336 |
| 35 | +``` |
| 36 | + |
| 37 | +2. Install the required Python packages: |
| 38 | + |
| 39 | +```sh |
| 40 | +pip install -r examples/llava/requirements.txt |
| 41 | +``` |
| 42 | + |
| 43 | +3. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents: |
| 44 | + |
| 45 | +```sh |
| 46 | +python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b |
| 47 | +``` |
| 48 | + |
| 49 | +4. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF: |
| 50 | + |
| 51 | +```sh |
| 52 | +python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b |
| 53 | +``` |
| 54 | + |
| 55 | +5. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF: |
| 56 | + |
| 57 | +```sh |
| 58 | +python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown |
| 59 | +``` |
| 60 | + |
| 61 | +Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory. |
| 62 | + |
| 63 | +## LLaVA 1.6 gguf conversion |
| 64 | +1) First clone a LLaVA 1.6 model: |
| 65 | +```console |
| 66 | +git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b |
| 67 | +``` |
| 68 | + |
| 69 | +2) Install the required Python packages: |
| 70 | + |
| 71 | +```sh |
| 72 | +pip install -r examples/llava/requirements.txt |
| 73 | +``` |
| 74 | + |
| 75 | +3) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models: |
| 76 | +```console |
| 77 | +python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/ |
| 78 | +``` |
| 79 | +- you will find a llava.projector and a llava.clip file in your model directory |
| 80 | + |
| 81 | +4) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory: |
| 82 | +```console |
| 83 | +mkdir vit |
| 84 | +cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin |
| 85 | +cp ../llava-v1.6-vicuna-7b/llava.projector vit/ |
| 86 | +curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json |
| 87 | +``` |
| 88 | + |
| 89 | +5) Create the visual gguf model: |
| 90 | +```console |
| 91 | +python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision |
| 92 | +``` |
| 93 | +- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP |
| 94 | + |
| 95 | +6) Then convert the model to gguf format: |
| 96 | +```console |
| 97 | +python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown |
| 98 | +``` |
| 99 | + |
| 100 | +7) And finally we can run the llava cli using the 1.6 model version: |
| 101 | +```console |
| 102 | +./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf |
| 103 | +``` |
| 104 | + |
| 105 | +**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096) |
| 106 | + |
| 107 | +**note** llava-1.6 greatly benefits from batched prompt processing (defaults work) |
| 108 | + |
| 109 | +**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model. |
| 110 | + |
| 111 | +```python |
| 112 | +import os |
| 113 | +import transformers |
| 114 | + |
| 115 | +model_path = ... |
| 116 | +llm_export_path = ... |
| 117 | + |
| 118 | +tokenizer = transformers.AutoTokenizer.from_pretrained(model_path) |
| 119 | +model = transformers.AutoModelForImageTextToText.from_pretrained(model_path) |
| 120 | + |
| 121 | +tokenizer.save_pretrained(llm_export_path) |
| 122 | +model.language_model.save_pretrained(llm_export_path) |
| 123 | +``` |
| 124 | + |
| 125 | +Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures. |
| 126 | + |
| 127 | +## Chat template |
| 128 | + |
| 129 | +For llava-1.5 and llava-1.6, you need to use `vicuna` chat template. Simply add `--chat-template vicuna` to activate this template. |
| 130 | + |
| 131 | + |
| 132 | +## How to know if you are running in llava-1.5 or llava-1.6 mode |
| 133 | + |
| 134 | +When running llava-cli you will see a visual information right before the prompt is being processed: |
| 135 | + |
| 136 | +**Llava-1.5:** |
| 137 | +`encode_image_with_clip: image embedding created: 576 tokens` |
| 138 | + |
| 139 | +**Llava-1.6 (anything above 576):** |
| 140 | +`encode_image_with_clip: image embedding created: 2880 tokens` |
| 141 | + |
| 142 | + |
| 143 | +Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6 |
0 commit comments