Skip to content

[BUG] IPEX model loading time is exponential vs expected linear #665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Qubitium opened this issue Nov 25, 2024 · 4 comments
Closed

[BUG] IPEX model loading time is exponential vs expected linear #665

Qubitium opened this issue Nov 25, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Qubitium
Copy link
Collaborator

Qubitium commented Nov 25, 2024

PR #660 contains new IPEX benchmark code. However I see a regression in ipex model loading when testing using larger model such as Qwen 2.5 Coder 32B quantized..

sample code below.. The model never loads correctly and and stuck during ipex model weight conversion. The 1B models loads fast but 32B loads has exponential slow down which is not right. Please check: @jiqing-feng

python ipex.py --model ModelCloud/Qwen2.5-Coder-32B-Instruct-gptqmodel-4bit-vortex-v1 --cores 8 --batch 1 --backend ipex

Something is causing ipex weight conversion to be non-linear.

^CTraceback (most recent call last):
  File "/root/GPTQModel-LRL/examples/benchmark/ipex.py", line 34, in <module>
    model = GPTQModel.load(ars.model, backend=BACKEND.IPEX)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 147, in load
    return cls.from_quantized(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/auto.py", line 205, in from_quantized
    return quant_func(
           ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/models/loader.py", line 516, in from_quantized
    model = gptqmodel_post_init(model, use_act_order=quantize_config.desc_act, quantize_config=quantize_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/utils/model.py", line 455, in gptqmodel_post_init
    submodule.post_init()
  File "/root/miniconda3/lib/python3.11/site-packages/gptqmodel/nn_modules/qlinear/qlinear_ipex.py", line 124, in post_init
    self.ipex_linear = WeightOnlyQuantizedLinear.from_weight(self.qweight, self.scales, self.qzeros, \
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/modules/weight_only_quantization.py", line 389, in from_weight
    return cls.from_int4_weight(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/modules/weight_only_quantization.py", line 319, in from_int4_weight
    qweight, scales, zero_points = _convert_optimum_format_to_desired(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_pytorch/nn/utils/_model_convert.py", line 217, in _convert_optimum_format_to_desired
    zp[:, index] = data.type(zp_dtype)
    ~~^^^^^^^^^^
@Qubitium Qubitium added the bug Something isn't working label Nov 25, 2024
@Qubitium
Copy link
Collaborator Author

Qubitium commented Nov 27, 2024

Update: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (fastest run)

@jiqing-feng
Copy link
Collaborator

Upudate: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (faster run)

Do you have DDR in your instance?

@Qubitium
Copy link
Collaborator Author

Upudate: on Intel Xeon 4th gen server, 32B model load time is ~8s. On local Intel Xeon 5th gen server, the same model quantized load time is ~70s (faster run)

Do you have DDR in your instance?

What is DDR?

@jiqing-feng
Copy link
Collaborator

sudo dmidecode -t memory can see your DDR config. The load efficiency usually isn't related to the CPU.

DDR is also known as DDR SDRAM (Synchronous Dynamic Random Access Memory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants