You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qwen2.5-VL-32B-Instruct getting error when trying to quantize:
Traceback (most recent call last):
File "/home/aipc/workspace/ai/qwen-vl/gptq_quant.py", line 90, in <module>
main()
File "/home/aipc/workspace/ai/qwen-vl/gptq_quant.py", line 42, in main
model = GPTQModel.load(pretrained_model_id, quantize_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 247, in load
return cls.from_pretrained(
^^^^^^^^^^^^^^^^^^^^
File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 275, in from_pretrained
model_type = check_and_get_model_type(model_id_or_path, trust_remote_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 184, in check_and_get_model_type
raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: qwen2_5_vl isn't supported yet.
GPU Info
Show output of:
nvidia-smi
Sat Mar 29 23:09:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 On | N/A |
| 0% 43C P8 23W / 350W | 413MiB / 24576MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2438 G /usr/lib/xorg/Xorg 100MiB |
| 0 N/A N/A 2651 G /usr/bin/gnome-shell 75MiB |
| 0 N/A N/A 3210 G ...irefox/5947/usr/lib/firefox/firefox 198MiB |
+-----------------------------------------------------------------------------------------+
Software Info
Operation System/Version + Python Version
Show output of:
pip show gptqmodel torch transformers accelerate triton
Name: gptqmodel
Version: 2.1.0
Summary: Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: accelerate, datasets, device-smi, hf_transfer, huggingface_hub, logbar, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by:
---
Name: torch
Version: 2.6.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, autoawq, compressed-tensors, gptqmodel, outlines, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.51.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: autoawq, compressed-tensors, gptqmodel, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.5.2
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: autoawq, gptqmodel
---
Name: triton
Version: 3.2.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License:
Location: /home/aipc/anaconda3/envs/autoawq/lib/python3.11/site-packages
Requires:
Required-by: autoawq, torch
If you are reporting an inference bug of a post-quantized model, please post the content of config.json and quantize_config.json.
To Reproduce
import os
from gptqmodel import GPTQModel, QuantizeConfig, get_best_device
from transformers import AutoTokenizer
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
pretrained_model_id = "Qwen/Qwen2.5-VL-32B-Instruct" # "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quantized_model_id = "Qwen2.5-VL-32B-Instruct-GPTQ"
def main():
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_id, use_fast=True)
calibration_dataset = [
tokenizer(
"gptqmodel is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = QuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = GPTQModel.load(pretrained_model_id, quantize_config)
# quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(calibration_dataset)
# save quantized model
model.save(quantized_model_id)
# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"GPTQModel model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"GPTQModel model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, commit_message=commit_message, use_auth_token=True)
# save quantized model using safetensors
model.save(quantized_model_id)
# load quantized model to the first GPU
device = get_best_device()
model = GPTQModel.load(quantized_model_id, device=device)
# load quantized model to CPU with IPEX kernel linear.
# model = GPTQModel.from_quantized(quantized_model_dir, device="cpu")
# download quantized model from Hugging Face Hub and load to the first GPU
# model = GPTQModel.from_quantized(repo_id, device="cuda:0",)
# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("gptqmodel is", return_tensors="pt").to(model.device))[0]))
if __name__ == "__main__":
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
level=logging.INFO,
datefmt="%Y-%m-%d %H:%M:%S",
)
main()
Expected behavior
quantizes the 32B VL version
Model/Datasets
Make sure your model/dataset is downloadable (on HF for example) so we can reproduce your issue. It is downloadable
Screenshots
If applicable, add screenshots to help explain your problem.
included the error ouput Additional context
Qwen2.5-32B-Instruct is new. but readme mentions qwen-vl is supported so expected it to work
The text was updated successfully, but these errors were encountered:
Describe the bug
Qwen2.5-VL-32B-Instruct getting error when trying to quantize:
GPU Info
Show output of:
Software Info
Operation System/Version + Python Version
Show output of:
If you are reporting an inference bug of a post-quantized model, please post the content of
config.json
andquantize_config.json
.To Reproduce
Expected behavior
quantizes the 32B VL version
Model/Datasets
Make sure your model/dataset is downloadable (on HF for example) so we can reproduce your issue.
It is downloadable
Screenshots
If applicable, add screenshots to help explain your problem.
included the error ouput
Additional context
Qwen2.5-32B-Instruct is new. but readme mentions qwen-vl is supported so expected it to work
The text was updated successfully, but these errors were encountered: