Skip to content

[Model] Qwen 2.5 VL compat #1490

Closed
Closed
@HarrisDePerceptron

Description

@HarrisDePerceptron

Describe the bug

Qwen2.5-VL-32B-Instruct getting error when trying to quantize:

Traceback (most recent call last):
  File "/home/aipc/workspace/ai/qwen-vl/gptq_quant.py", line 90, in <module>
    main()
  File "/home/aipc/workspace/ai/qwen-vl/gptq_quant.py", line 42, in main
    model = GPTQModel.load(pretrained_model_id, quantize_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 247, in load
    return cls.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 275, in from_pretrained
    model_type = check_and_get_model_type(model_id_or_path, trust_remote_code)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 184, in check_and_get_model_type
    raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: qwen2_5_vl isn't supported yet.

GPU Info

Show output of:

nvidia-smi
Sat Mar 29 23:09:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
|  0%   43C    P8             23W /  350W |     413MiB /  24576MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2438      G   /usr/lib/xorg/Xorg                            100MiB |
|    0   N/A  N/A      2651      G   /usr/bin/gnome-shell                           75MiB |
|    0   N/A  N/A      3210      G   ...irefox/5947/usr/lib/firefox/firefox        198MiB |
+-----------------------------------------------------------------------------------------+

Software Info

Operation System/Version + Python Version

Show output of:

pip show gptqmodel torch transformers accelerate triton

Name: gptqmodel
Version: 2.1.0
Summary: Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: accelerate, datasets, device-smi, hf_transfer, huggingface_hub, logbar, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.6.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, autoawq, compressed-tensors, gptqmodel, outlines, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.51.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: autoawq, compressed-tensors, gptqmodel, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.5.2
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: autoawq, gptqmodel
---
Name: triton
Version: 3.2.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: 
Required-by: autoawq, torch

If you are reporting an inference bug of a post-quantized model, please post the content of config.json and quantize_config.json.

To Reproduce

import os

from gptqmodel import GPTQModel, QuantizeConfig, get_best_device
from transformers import AutoTokenizer

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

pretrained_model_id = "Qwen/Qwen2.5-VL-32B-Instruct" # "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quantized_model_id = "Qwen2.5-VL-32B-Instruct-GPTQ"


def main():
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_id, use_fast=True)
    calibration_dataset = [
        tokenizer(
            "gptqmodel is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
        )
    ]

    quantize_config = QuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
    )

    # load un-quantized model, by default, the model will always be loaded into CPU memory
    model = GPTQModel.load(pretrained_model_id, quantize_config)

    # quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
    model.quantize(calibration_dataset)

    # save quantized model
    model.save(quantized_model_id)

    # push quantized model to Hugging Face Hub.
    # to use use_auth_token=True, Login first via huggingface-cli login.
    # or pass explcit token with: use_auth_token="hf_xxxxxxx"
    # (uncomment the following three lines to enable this feature)
    # repo_id = f"YourUserName/{quantized_model_dir}"
    # commit_message = f"GPTQModel model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
    # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

    # alternatively you can save and push at the same time
    # (uncomment the following three lines to enable this feature)
    # repo_id = f"YourUserName/{quantized_model_dir}"
    # commit_message = f"GPTQModel model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
    # model.push_to_hub(repo_id, save_dir=quantized_model_dir, commit_message=commit_message, use_auth_token=True)

    # save quantized model using safetensors
    model.save(quantized_model_id)

    # load quantized model to the first GPU
    device = get_best_device()
    model = GPTQModel.load(quantized_model_id, device=device)

    # load quantized model to CPU with IPEX kernel linear.
    # model = GPTQModel.from_quantized(quantized_model_dir, device="cpu")

    # download quantized model from Hugging Face Hub and load to the first GPU
    # model = GPTQModel.from_quantized(repo_id, device="cuda:0",)

    # inference with model.generate
    print(tokenizer.decode(model.generate(**tokenizer("gptqmodel is", return_tensors="pt").to(model.device))[0]))


if __name__ == "__main__":
    import logging

    logging.basicConfig(
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
        level=logging.INFO,
        datefmt="%Y-%m-%d %H:%M:%S",
    )

    main()

Expected behavior
quantizes the 32B VL version

Model/Datasets

Make sure your model/dataset is downloadable (on HF for example) so we can reproduce your issue.
It is downloadable

Screenshots

If applicable, add screenshots to help explain your problem.
included the error ouput
Additional context

Qwen2.5-32B-Instruct is new. but readme mentions qwen-vl is supported so expected it to work

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions