AutoGPTQ backend can't load local model files


**LocalAI version:**
Docker image: `localai/localai:v2.9.0-cublas-cuda12-core` with extra backend `autogptq`

**Environment, CPU architecture, OS, and Version:**
```
# nvidia-smi
Fri Mar  8 05:21:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     On  | 00000000:F0:00.0 Off |                    0 |
|  0%   29C    P8              15W / 150W |      2MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10                     On  | 00000000:F1:00.0 Off |                    0 |
|  0%   29C    P8              15W / 150W |      2MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
```

**Describe the bug**
Trying to start Qwen-VL-Chat-int4 model, but failed due to autogptq can't find the `config.json` in the model folder. 

**To Reproduce**
1. Build docker image with Dockerfile:
```Dockerfile
FROM localai/localai:v2.9.0-cublas-cuda12-core

RUN apt-get update -y && apt-get install -y  curl gcc libxml2 libxml2-dev
RUN apt install -y wget git && \
    apt clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"

RUN wget \
    https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && mkdir .conda \
    && bash Miniconda3-latest-Linux-x86_64.sh -b \
    && rm -f Miniconda3-latest-Linux-x86_64.sh
RUN conda init bash

RUN PATH=$PATH:/opt/conda/bin make -C backend/python/autogptq
ENV EXTERNAL_GRPC_BACKENDS="autogptq:/build/backend/python/autogptq/run.sh"
ENV BUILD_TYPE="cublas"
```
2. Downlaod the model files to local drive:
`huggingface-cli download --resume-download Qwen/Qwen-VL-Chat-Int4 --local-dir qwen-vl-chat-int4 --local-dir-use-symlinks False`

3. Create `qwen-vl.yaml` file
```yaml
  # Model name.
  # The model name is used to identify the model in the API calls.
- name: gpt-4-vision-preview
  # Default model parameters.
  # These options can also be specified in the API calls
  parameters:
    model: qwen-vl-chat-int4
    temperature: 0.7
    top_k: 85
    top_p: 0.7

  # Default context size
  context_size: 4096
  # Default number of threads
  threads: 16
  backend: autogptq

  # define chat roles
  roles:
    user: "user:"
    assistant: "assistant:"
    system: "system:"
  template:
    chat: &template |
      Instruct: {{.Input}}
      Output:
    # Modify the prompt template here ^^^ as per your requirements
    completion: *template 
  # Enable F16 if backend supports it
  f16: true
  embeddings: false
  # Enable debugging
  debug: true

  # GPU Layers (only used when built with cublas)
  gpu_layers: -1

  # Diffusers/transformers
  cuda: true

```
4. Run the model:
`docker run -p 8080:8080 -v $PWD/models:/opt/models -e MODELS_PATH=/opt/models localai:v2.9.0-autogptq  --config-file /opt/models/qwen-vl.yaml`

5. Call the API
```bash
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
   "model": "gpt-4-vision-preview",
   "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
```
 
**Expected behavior**
Respond with answers.

**Logs**
```
{
  "error": {
    "code": 500,
    "message": "could not load model (no success): Unexpected err=OSError(\"We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like qwen-vl-chat-int4 is not the path to a directory containing a file named config.json.\\nCheckout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.\"), type(err)=<class 'OSError'>",
    "type": ""
  }
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AutoGPTQ backend can't load local model files #1812

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

AutoGPTQ backend can't load local model files #1812

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions