Description
The server example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete
-
Support chat templates
We need to have separation between the user input and the special tokens, so that the tokenization is performed correctly. See the following comments / commits for more context:
Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160 (comment)
c544fae
Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160 (comment)We already support extracting meta information from the GGUF model files that can provide the chat template for the specific model:
gguf-py : export chat templates #4125
Support chat template for/v1/chat/completions
: Server: use llama_chat_apply_template #5593
List of supported templates: view on wikiSupporting this in
server
would require changes both in the backend and the frontend -
Likely redundant logic for OpenAI (OAI) compatibility that should be removed
server : OAI API compatibility #4198 (comment) -
Use multiple mount points for the OAI API
https://github.com/ggerganov/llama.cpp/blob/af19d3573481d409b3c4e55494810eb1f65a9aae/examples/server/server.cpp#L2682-L2684
Add "/chat/completions" as alias for "/v1/chat/completions" #5722 -
Return meaningful errors on KV cache overflow
update_slots : failed to decode the batch #4185 (comment) -
Refactor the code
With the recent additions for parallel decoding support for multiple clients and LLaVA, I feel the code base became very cumbersome and there is a lot of room for refactoring and improving the code. There should be some effort dedicated to cleaning up things and simplifying the code.
Server: try to refactor server.cpp #5065
Server: Improve work queue stability #5710 -
Batched decoding endpoint?
Although we added parallel decoding support via "slots", we are still lacking batched decoding where a single client could pass an array of prompts to be completed. Or alternatively, generate multiple completions for a single prompt. Would be useful to support this use case
llama : add batched inference endpoint to server #3478 (comment) -
Tool calls (function calling)
Support for MeetKai/functionary model by implementing OpenAI-compatible tool calls to chat endpoint.
Server: add support for "tool_calls" (MeetKai/functionary model) #5695 -
Multimodal support
Support has been temporary dropped in server : refactor #5882, before working inserver
, we should improvellava-cli
and the API for using LLaVA- server: Bring back multimodal support #8010
- llava-cli: improve llava-cli and the API for using LLaVA #6027
- server : refactor #5882 (comment)
- server : refactor #5882 (comment)
- server: multimodal - fix misreported prompt and num prompt tokens #5896
- llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592
- Unable to assign mmproj value when running docker #6226
-
Prompt processing improvment
-
Server production readiness
This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.
Have a look to issues labelled with server/webui.