[Model] Qwen 2.5 VL compat

**Describe the bug**

Qwen2.5-VL-32B-Instruct getting error when trying to  quantize:
```
Traceback (most recent call last):
  File "/home/aipc/workspace/ai/qwen-vl/gptq_quant.py", line 90, in <module>
    main()
  File "/home/aipc/workspace/ai/qwen-vl/gptq_quant.py", line 42, in main
    model = GPTQModel.load(pretrained_model_id, quantize_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 247, in load
    return cls.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 275, in from_pretrained
    model_type = check_and_get_model_type(model_id_or_path, trust_remote_code)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 184, in check_and_get_model_type
    raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: qwen2_5_vl isn't supported yet.
```

**GPU Info**

Show output of:

```
nvidia-smi
Sat Mar 29 23:09:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0  On |                  N/A |
|  0%   43C    P8             23W /  350W |     413MiB /  24576MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2438      G   /usr/lib/xorg/Xorg                            100MiB |
|    0   N/A  N/A      2651      G   /usr/bin/gnome-shell                           75MiB |
|    0   N/A  N/A      3210      G   ...irefox/5947/usr/lib/firefox/firefox        198MiB |
+-----------------------------------------------------------------------------------------+

```

**Software Info**

Operation System/Version + Python Version

Show output of:
```
pip show gptqmodel torch transformers accelerate triton

Name: gptqmodel
Version: 2.1.0
Summary: Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: qubitium@modelcloud.ai
License: Apache 2.0
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: accelerate, datasets, device-smi, hf_transfer, huggingface_hub, logbar, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.6.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, autoawq, compressed-tensors, gptqmodel, outlines, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.51.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: autoawq, compressed-tensors, gptqmodel, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.5.2
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: autoawq, gptqmodel
---
Name: triton
Version: 3.2.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License: 
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: 
Required-by: autoawq, torch

```

**If you are reporting an inference bug of a post-quantized model, please post the content of `config.json` and `quantize_config.json`.**

**To Reproduce**

```
import os

from gptqmodel import GPTQModel, QuantizeConfig, get_best_device
from transformers import AutoTokenizer

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

pretrained_model_id = "Qwen/Qwen2.5-VL-32B-Instruct" # "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quantized_model_id = "Qwen2.5-VL-32B-Instruct-GPTQ"


def main():
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_id, use_fast=True)
    calibration_dataset = [
        tokenizer(
            "gptqmodel is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
        )
    ]

    quantize_config = QuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
    )

    # load un-quantized model, by default, the model will always be loaded into CPU memory
    model = GPTQModel.load(pretrained_model_id, quantize_config)

    # quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
    model.quantize(calibration_dataset)

    # save quantized model
    model.save(quantized_model_id)

    # push quantized model to Hugging Face Hub.
    # to use use_auth_token=True, Login first via huggingface-cli login.
    # or pass explcit token with: use_auth_token="hf_xxxxxxx"
    # (uncomment the following three lines to enable this feature)
    # repo_id = f"YourUserName/{quantized_model_dir}"
    # commit_message = f"GPTQModel model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
    # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

    # alternatively you can save and push at the same time
    # (uncomment the following three lines to enable this feature)
    # repo_id = f"YourUserName/{quantized_model_dir}"
    # commit_message = f"GPTQModel model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
    # model.push_to_hub(repo_id, save_dir=quantized_model_dir, commit_message=commit_message, use_auth_token=True)

    # save quantized model using safetensors
    model.save(quantized_model_id)

    # load quantized model to the first GPU
    device = get_best_device()
    model = GPTQModel.load(quantized_model_id, device=device)

    # load quantized model to CPU with IPEX kernel linear.
    # model = GPTQModel.from_quantized(quantized_model_dir, device="cpu")

    # download quantized model from Hugging Face Hub and load to the first GPU
    # model = GPTQModel.from_quantized(repo_id, device="cuda:0",)

    # inference with model.generate
    print(tokenizer.decode(model.generate(**tokenizer("gptqmodel is", return_tensors="pt").to(model.device))[0]))


if __name__ == "__main__":
    import logging

    logging.basicConfig(
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
        level=logging.INFO,
        datefmt="%Y-%m-%d %H:%M:%S",
    )

    main()

```

**Expected behavior**
quantizes the 32B VL version

**Model/Datasets**

Make sure your model/dataset is downloadable (on HF for example) so we can reproduce your issue.
`It is downloadable`

**Screenshots**

If applicable, add screenshots to help explain your problem.
included the error ouput
**Additional context**

Qwen2.5-32B-Instruct is new. but readme mentions qwen-vl is supported so expected it to work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model] Qwen 2.5 VL compat #1490

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Model] Qwen 2.5 VL compat #1490

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions