Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider re-working the vLLM Gauge exposing the currently active LoRAs #354

Open
markmc opened this issue Feb 18, 2025 · 5 comments
Open

Comments

@markmc
Copy link

markmc commented Feb 18, 2025

In vLLM, we are finishing an effort (known as "V1") to re-architect the core LLM engine - see vllm-project/vllm#10582

One of the last remaining feature gaps versus V0 is metrics - see vllm-project/vllm#10582

As we carry over the V0 metrics into V1, we are re-evaluating what makes sense and considering deprecating or re-working some metrics - see vllm-project/vllm#12745

The vllm:lora_requests_info added by vllm-project/vllm#9477 looks particularly unorthodox - it's value being a timestamp, and the status of adapters encoded as a comma-separated string in label:

vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09

At first blush, it seems like this would achieve the same thing in a more standard representation:

vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0

This sort of thing is separately proposed in vllm-project/vllm#11091. There would also be a lora_config Info metric which exposes max_lora

Feel free to discuss either here or in vllm-project/vllm#13303. Thanks!

markmc added a commit to markmc/vllm that referenced this issue Feb 18, 2025
As discussed in vllm-project#13303, this metric perhaps isn't the most ideal
solution for the use case but, given there is an existing user,
we should retain compatibility in V1 and deprecate it if we
replace it with a different metric.

See also kubernetes-sigs/gateway-api-inference-extension#354

Signed-off-by: Mark McLoughlin <[email protected]>
markmc added a commit to markmc/vllm that referenced this issue Feb 18, 2025
```
vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09
```

As discussed in vllm-project#13303, this metric perhaps isn't the most ideal
solution for the use case but, given there is an existing user,
we should retain compatibility in V1 and deprecate it if we
replace it with a different metric.

See also kubernetes-sigs/gateway-api-inference-extension#354

Signed-off-by: Mark McLoughlin <[email protected]>
@liu-cong
Copy link
Contributor

The metrics are currently used in the gateway-api-inference-extension project. While I agree the current implementation is not ideal, I'd like to make sure the new implementation doesn't break the dependency.

How to get the same running_lora_adapters information from the newly proposed metric? Will a lora adapter that's not running anymore show up with a value of 0 like so ?vllm:num_lora_requests_running{lora_name="non-running-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0, or will it not show up? The latter is preferred, since by iterating over all the lora_name labels we can essentially get the equivalent.

@markmc
Copy link
Author

markmc commented Feb 19, 2025

The metrics are currently used in the gateway-api-inference-extension project. While I agree the current implementation is not ideal, I'd like to make sure the new implementation doesn't break the dependency.

See #13504 - V1 will have a compatible implementation. However, it is quite likely we will deprecate and replace it sooner or later.

How to get the same running_lora_adapters information from the newly proposed metric? Will a lora adapter that's not running anymore show up with a value of 0 like so ?vllm:num_lora_requests_running{lora_name="non-running-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0, or will it not show up? The latter is preferred, since by iterating over all the lora_name labels we can essentially get the equivalent.

Good question, the proposal requires more work. But see the further discussion in #13303 - the current metric really isn't doing what you right now either AFAICT. If a LoRA had no scheduled requests in the most recent batch, its weights may still be loaded in GPU memory, and the vLLM instance should still be considered a good target for requests

@liu-cong
Copy link
Contributor

the current metric really isn't doing what you right now either AFAICT.

The current implementation provides a good proxy in an environment where most lora adapters are actively serving traffic.

But agree with you. If we can get the "real" active lora metric in v1, that will be awesome.

@kfswain
Copy link
Collaborator

kfswain commented Feb 19, 2025

CC: @kfswain @kaushikmitr

@liu-cong
Copy link
Contributor

Great discussion here and in #13303!

@courageJ pointed out that using timestamp as the value and some metadata as labels is an existing pattern in k8s, see the kube_pod_container_status_last_terminated_timestamp metric example in https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

And the semantics is "this is the last timestamp the server observes this state". I think we can clarify this with better documentation.

@markmc I saw vllm-project/vllm#13504 to reimplement this in v1, is this the current plan? If so, then I think we should be fine. Keep us in the loop if you'd like pursue further refactoring. And thanks again for finding us as one of the consumer and bringing this to our attention!

markmc added a commit to markmc/vllm that referenced this issue Feb 24, 2025
```
vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09
```

As discussed in vllm-project#13303, this metric perhaps isn't the most ideal
solution for the use case but, given there is an existing user,
we should retain compatibility in V1 and deprecate it if we
replace it with a different metric.

See also kubernetes-sigs/gateway-api-inference-extension#354

Signed-off-by: Mark McLoughlin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants