[BUG] IPEX model loading time is exponential vs expected linear #665

Qubitium · 2024-11-25T16:37:34Z

PR #660 contains new IPEX benchmark code. However I see a regression in ipex model loading when testing using larger model such as Qwen 2.5 Coder 32B quantized..

sample code below.. The model never loads correctly and and stuck during ipex model weight conversion. The 1B models loads fast but 32B loads has exponential slow down which is not right. Please check: @jiqing-feng

python ipex.py --model ModelCloud/Qwen2.5-Coder-32B-Instruct-gptqmodel-4bit-vortex-v1 --cores 8 --batch 1 --backend ipex

Something is causing ipex weight conversion to be non-linear.

^CTraceback (most recent call last):
  File "/root/GPTQModel-LRL/examples/benchmark/ipex.py", line 34, in <module>
    model = GPTQModel.load(ars.model, backend=BACKEND.IPEX)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 147, in load
    return cls.from_quantized(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 205, in from_quantized
    return quant_func(
           ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/loader.py", line 516, in from_quantized
    model = gptqmodel_post_init(model, use_act_order=quantize_config.desc_act, quantize_config=quantize_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/utils/model.py", line 455, in gptqmodel_post_init
    submodule.post_init()
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/nn_modules/qlinear/qlinear_ipex.py", line 124, in post_init
    self.ipex_linear = WeightOnlyQuantizedLinear.from_weight(self.qweight, self.scales, self.qzeros, \
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/modules/weight_only_quantization.py", line 389, in from_weight
    return cls.from_int4_weight(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/modules/weight_only_quantization.py", line 319, in from_int4_weight
    qweight, scales, zero_points = _convert_optimum_format_to_desired(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/utils/_model_convert.py", line 217, in _convert_optimum_format_to_desired
    zp[:, index] = data.type(zp_dtype)
    ~~^^^^^^^^^^

The text was updated successfully, but these errors were encountered:

Qubitium · 2024-11-27T05:23:11Z

Update: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (fastest run)

jiqing-feng · 2024-11-27T05:28:16Z

Upudate: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (faster run)

Do you have DDR in your instance?

Qubitium · 2024-11-27T05:30:43Z

Upudate: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (faster run)

Do you have DDR in your instance?

What is DDR?

jiqing-feng · 2024-11-27T05:34:20Z

sudo dmidecode -t memory can see your DDR config. The load efficiency usually isn't related to the CPU.

DDR is also known as DDR SDRAM (Synchronous Dynamic Random Access Memory)

Qubitium added the bug Something isn't working label Nov 25, 2024

Qubitium closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] IPEX model loading time is exponential vs expected linear #665

[BUG] IPEX model loading time is exponential vs expected linear #665

Qubitium commented Nov 25, 2024 •

edited

Loading

Qubitium commented Nov 27, 2024 •

edited

Loading

jiqing-feng commented Nov 27, 2024

Qubitium commented Nov 27, 2024

jiqing-feng commented Nov 27, 2024

[BUG] IPEX model loading time is exponential vs expected linear #665

[BUG] IPEX model loading time is exponential vs expected linear #665

Comments

Qubitium commented Nov 25, 2024 • edited Loading

Qubitium commented Nov 27, 2024 • edited Loading

jiqing-feng commented Nov 27, 2024

Qubitium commented Nov 27, 2024

jiqing-feng commented Nov 27, 2024

Qubitium commented Nov 25, 2024 •

edited

Loading

Qubitium commented Nov 27, 2024 •

edited

Loading