server : improvements and maintenance

The [server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete

- [x] **Support chat templates**
  We need to have separation between the user input and the special tokens, so that the tokenization is performed correctly. See the following comments / commits for more context:
  https://github.com/ggerganov/llama.cpp/pull/4160#discussion_r1403675264
  https://github.com/ggerganov/llama.cpp/pull/4198/commits/c544faed749240fe5eac2bc042087c71f79a0728
  https://github.com/ggerganov/llama.cpp/pull/4160#issuecomment-1824984718

  We already support extracting meta information from the GGUF model files that can provide the chat template for the specific model: 
  https://github.com/ggerganov/llama.cpp/pull/4125
  Support chat template for `/v1/chat/completions`: https://github.com/ggerganov/llama.cpp/pull/5593
  List of supported templates: [view on wiki](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) 

  Supporting this in `server` would require changes both in the backend and the frontend

- [x] **Likely redundant logic for OpenAI (OAI) compatibility that should be removed**
  https://github.com/ggerganov/llama.cpp/pull/4198#discussion_r1404500731

- [x] **Use multiple mount points for the OAI API**
  https://github.com/ggerganov/llama.cpp/blob/af19d3573481d409b3c4e55494810eb1f65a9aae/examples/server/server.cpp#L2682-L2684
  https://github.com/ggerganov/llama.cpp/pull/5722

- [x] **Return meaningful errors on KV cache overflow**
  https://github.com/ggerganov/llama.cpp/issues/4185#issuecomment-1825721736

- [x] **Refactor the code**
  With the recent additions for parallel decoding support for multiple clients and LLaVA, I feel the code base became very cumbersome and there is a lot of room for refactoring and improving the code. There should be some effort dedicated to cleaning up things and simplifying the code.
  https://github.com/ggerganov/llama.cpp/pull/5065
  https://github.com/ggerganov/llama.cpp/pull/5710

- [x] **Batched decoding endpoint?**
  Although we added parallel decoding support via "slots", we are still lacking batched decoding where a single client could pass an array of prompts to be completed. Or alternatively, generate multiple completions for a single prompt. Would be useful to support this use case
  https://github.com/ggerganov/llama.cpp/issues/3478#issuecomment-1822010431

- [ ] **Tool calls (function calling)**
  Support for [MeetKai/functionary](https://github.com/MeetKai/functionary) model by implementing [OpenAI-compatible tool calls](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool) to chat endpoint.
  https://github.com/ggerganov/llama.cpp/pull/5695

- [ ] **Multimodal support**
  Support has been temporary dropped in #5882, before working in `server`, we should improve `llava-cli` and the API for using LLaVA
  - #8010
  - #6027
  - https://github.com/ggerganov/llama.cpp/pull/5882#issuecomment-1980713874
  - https://github.com/ggerganov/llama.cpp/pull/5882#issuecomment-1991583459
  - #5896
  - #5592
  - #6226

- [ ] **Prompt processing improvment**
  - #6586
  - #6607
 
- [ ] **Server production readiness**
  - https://github.com/ggerganov/llama.cpp/discussions/6398
  - #6546

This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.

Have a look to issues labelled with [server/webui](https://github.com/ggerganov/llama.cpp/labels/server%2Fwebui).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : improvements and maintenance #4216

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server : improvements and maintenance #4216

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions