You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #660 contains new IPEX benchmark code. However I see a regression in ipex model loading when testing using larger model such as Qwen 2.5 Coder 32B quantized..
sample code below.. The model never loads correctly and and stuck during ipex model weight conversion. The 1B models loads fast but 32B loads has exponential slow down which is not right. Please check: @jiqing-feng
Something is causing ipex weight conversion to be non-linear.
^CTraceback (most recent call last):
File "/root/GPTQModel-LRL/examples/benchmark/ipex.py", line 34, in <module>
model = GPTQModel.load(ars.model, backend=BACKEND.IPEX)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 147, in load
return cls.from_quantized(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 205, in from_quantized
return quant_func(
^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/loader.py", line 516, in from_quantized
model = gptqmodel_post_init(model, use_act_order=quantize_config.desc_act, quantize_config=quantize_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/utils/model.py", line 455, in gptqmodel_post_init
submodule.post_init()
File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/nn_modules/qlinear/qlinear_ipex.py", line 124, in post_init
self.ipex_linear = WeightOnlyQuantizedLinear.from_weight(self.qweight, self.scales, self.qzeros, \
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/modules/weight_only_quantization.py", line 389, in from_weight
return cls.from_int4_weight(
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/modules/weight_only_quantization.py", line 319, in from_int4_weight
qweight, scales, zero_points = _convert_optimum_format_to_desired(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/utils/_model_convert.py", line 217, in _convert_optimum_format_to_desired
zp[:, index] = data.type(zp_dtype)
~~^^^^^^^^^^
The text was updated successfully, but these errors were encountered:
Update: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (fastest run)
Upudate: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (faster run)
Upudate: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (faster run)
PR #660 contains new IPEX benchmark code. However I see a regression in ipex model loading when testing using larger model such as Qwen 2.5 Coder 32B quantized..
sample code below.. The model never loads correctly and and stuck during ipex model weight conversion. The 1B models loads fast but 32B loads has exponential slow down which is not right. Please check: @jiqing-feng
Something is causing ipex weight conversion to be non-linear.
The text was updated successfully, but these errors were encountered: