Skip to content

feat: add llamacpp params #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions src/chat_completion_request.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#pragma once
#include "json/value.h"
#include "sampling.h"

namespace llama::inferences {
struct ChatCompletionRequest {
Expand All @@ -12,10 +13,29 @@ struct ChatCompletionRequest {
Json::Value stop = Json::Value(Json::arrayValue);
Json::Value messages = Json::Value(Json::arrayValue);
std::string model_id;

int seed = -1;
float dynatemp_range = 0.0f;
float dynatemp_exponent = 1.0f;
int top_k = 40;
float min_p = 0.05f;
float tfs_z = 1.0f;
float typ_p = 1.0f;
int repeat_last_n = 64;
float penalty_repeat = 1.0f;
bool mirostat = false;
float mirostat_tau = 5.0f;
float mirostat_eta = 0.1f;
bool penalize_nl = false;
bool ignore_eos = false;
int n_probs = 0;
int min_keep = 0;
std::string grammar;
};

inline ChatCompletionRequest fromJson(std::shared_ptr<Json::Value> jsonBody) {
ChatCompletionRequest completion;
gpt_sampler_params default_params;
if (jsonBody) {
completion.stream = (*jsonBody).get("stream", false).asBool();
completion.max_tokens = (*jsonBody).get("max_tokens", 500).asInt();
Expand All @@ -28,6 +48,24 @@ inline ChatCompletionRequest fromJson(std::shared_ptr<Json::Value> jsonBody) {
completion.messages = (*jsonBody)["messages"];
completion.stop = (*jsonBody)["stop"];
completion.model_id = (*jsonBody).get("model", {}).asString();

completion.seed = (*jsonBody).get("seed", -1).asInt();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice this PR defines default values twice:

  • struct definition (above)
  • JSON parsing default value

Are we able to define once?

DRY principle: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

Copy link
Contributor Author

@nguyenhoangthuan99 nguyenhoangthuan99 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the previous implementation like this one https://github.com/janhq/cortex.llamacpp/blob/main/src/chat_completion_request.h#L8, maybe some weird bug in the past force us to do it. Like this PR I fixed the race condition, even if we checked the everything code (mutex, only return slot if available, ... ) but error still pop up. So I have to add another check for if (slot==null) so that the issue can be resolved.

We are using third party lib for json so I think it's no harm to double check and make sure it work well, but if it necessary, I'll change it, but we need to test more. to make sure it won't break any thing

completion.dynatemp_range = (*jsonBody).get("dynatemp_range", 0.0f).asFloat();
completion.dynatemp_exponent = (*jsonBody).get("dynatemp_exponent", 0.0f).asFloat();
completion.top_k = (*jsonBody).get("top_k", 40).asInt();
completion.min_p = (*jsonBody).get("min_p", 0.05f).asFloat();
completion.tfs_z = (*jsonBody).get("tfs_z", 1.0f).asFloat();
completion.typ_p = (*jsonBody).get("typ_p", 1.0f).asFloat();
completion.repeat_last_n = (*jsonBody).get("repeat_last_n", 64).asInt();
completion.penalty_repeat = (*jsonBody).get("repeat_penalty", 1.1f).asFloat();
completion.mirostat = (*jsonBody).get("mirostat", false).asBool();
completion.mirostat_tau = (*jsonBody).get("mirostat_tau", 5.0f).asFloat();
completion.mirostat_eta = (*jsonBody).get("mirostat_eta", 0.1f).asFloat();
completion.penalize_nl = (*jsonBody).get("penalize_nl", true).asBool();
completion.ignore_eos = (*jsonBody).get("ignore_eos", false).asBool();
completion.n_probs = (*jsonBody).get("n_probs", 0).asInt();
completion.min_keep = (*jsonBody).get("min_keep", 0).asInt();
completion.grammar = (*jsonBody).get("grammar", "").asString();
}
return completion;
}
Expand Down
31 changes: 29 additions & 2 deletions src/llama_engine.cc
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,10 @@ bool LlamaEngine::LoadModelImpl(std::shared_ptr<Json::Value> json_body) {
if (!params.use_mmap) {
LOG_DEBUG << "Disabled mmap";
}
params.n_predict = json_body->get("n_predict", -1).asInt();
params.prompt = json_body->get("prompt", "").asString();
params.conversation = json_body->get("conversation", false).asBool();
params.special = json_body->get("special", false).asBool();

server_map_[model_id].caching_enabled =
json_body->get("caching_enabled", true).asBool();
Expand Down Expand Up @@ -599,6 +603,24 @@ void LlamaEngine::HandleInferenceImpl(
data["temperature"] = completion.temperature;
data["frequency_penalty"] = completion.frequency_penalty;
data["presence_penalty"] = completion.presence_penalty;
data["seed"] = completion.seed;
data["dynatemp_range"] = completion.dynatemp_range;
data["dynatemp_exponent"] = completion.dynatemp_exponent;
data["top_k"] = completion.top_k;
data["min_p"] = completion.min_p;
data["tfs_z"] = completion.tfs_z;
data["typical_p"] = completion.typ_p;
data["repeat_last_n"] = completion.repeat_last_n;
data["repeat_penalty"] = completion.penalty_repeat;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, is there a way for us to align our penalty_repeat param with the original llama.cpp repeat_penalty?

  • This is the sort of thing that trips an intern up a year from now
  • If we align all params, is there a more elegant way to copy aligned k-v pairs from one struct to another? (llama3.1 tells me std::copy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's impossible because the data is a json datatype, and completion is our custom struct datatype. json don't have overload operator = for json with our custom completion struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about the penalty_repeat and repeat_penalty, it is the same with previous implementation with frequency_penalty https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L445 . I think it is a way to provide unique params template interface for API.

data["mirostat"] = completion.mirostat;
data["mirostat_tau"] = completion.mirostat_tau;
data["mirostat_eta"] = completion.mirostat_eta;
data["penalize_nl"] = completion.penalize_nl;
data["ignore_eos"] = completion.ignore_eos;
data["n_probs"] = completion.n_probs;
data["min_keep"] = completion.min_keep;
data["grammar"] = completion.grammar;
int n_probs = completion.n_probs;
const Json::Value& messages = completion.messages;

if (!si.grammar_file_content.empty()) {
Expand Down Expand Up @@ -717,12 +739,17 @@ void LlamaEngine::HandleInferenceImpl(
auto state = CreateInferenceState(si.ctx);

// Queued task
si.q->runTaskInQueue([cb = std::move(callback), state, data, request_id]() {
si.q->runTaskInQueue([cb = std::move(callback), state, data, request_id, n_probs]() {
state->task_id = state->llama.RequestCompletion(data, false, false, -1);
while (state->llama.model_loaded_external) {
TaskResult result = state->llama.NextResult(state->task_id);
if (!result.error) {
std::string to_send = result.result_json["content"];
std::string to_send;
if (n_probs > 0){

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I verify my understanding about n_probs:

  • From llama.cpp server docs if n_probs > 0, resp contains probabilities of N tokens
  • However, we don't seem to send the content back to the user in this case?
  • Or: do we send both content, and also completion_probabilities?

We should align with the conventions in llama.cpp's server, as much as possible

Resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our implementation can return this form ggml-org/llama.cpp#4088 (comment), both content and list of (token - probs) for each token

to_send = result.result_json["completion_probabilities"].dump();
}else{
to_send = result.result_json["content"];
}
// trim the leading space if it is the first token
if (std::exchange(state->is_first_token, false)) {
llama_utils::ltrim(to_send);
Expand Down
18 changes: 16 additions & 2 deletions src/llama_server_context.cc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#include "llama_server_context.h"

#include "sampling.h"
namespace {
const std::string base64_chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
Expand Down Expand Up @@ -458,6 +458,15 @@ bool LlamaServerContext::LaunchSlotWithData(LlamaClientSlot*& slot, json data) {
slot->params.seed = json_value(data, "seed", default_params.seed);
slot->sparams.grammar = json_value(data, "grammar", default_sparams.grammar);
slot->sparams.n_probs = json_value(data, "n_probs", default_sparams.n_probs);
slot->sparams.min_keep =
json_value(data, "min_keep", default_sparams.min_keep);
slot->sparams.seed = json_value(data, "seed", default_sparams.seed);
slot->sparams.dynatemp_range =
json_value(data, "dynatemp_range", default_sparams.dynatemp_range);
slot->sparams.dynatemp_exponent =
json_value(data, "dynatemp_exponent", default_sparams.dynatemp_exponent);
slot->sparams.ignore_eos =
json_value(data, "ignore_eos", default_sparams.ignore_eos);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I check my understanding:

  • This code filters for top N tokens given sampling settings
  • Fills out completion_probabilities k-v

Copy link
Contributor Author

@nguyenhoangthuan99 nguyenhoangthuan99 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the return n probs is in serveral place inside codebase.
At this part, for every inference step it will add n probs to the result https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L1675.

With stream mode in this line it will form the return json https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L919


// infill
if (data.count("input_prefix") != 0) {
Expand Down Expand Up @@ -969,8 +978,13 @@ void LlamaServerContext::SendFinalResponse(LlamaClientSlot& slot) {
slot.generated_token_probs.begin(),
slot.generated_token_probs.begin() + slot.sent_token_probs_index);
}
res.result_json["completion_probabilities"] =
if(!slot.params.stream ){
res.result_json["completion_probabilities"] =
probs_vector_to_json(ctx, probs);
}
else{
res.result_json["completion_probabilities"] = std::move(json());
}
}

if (slot.oaicompat) {
Expand Down
Loading