Consider re-working the vLLM Gauge exposing the currently active LoRAs #354

markmc · 2025-02-18T12:04:14Z

In vLLM, we are finishing an effort (known as "V1") to re-architect the core LLM engine - see vllm-project/vllm#10582

One of the last remaining feature gaps versus V0 is metrics - see vllm-project/vllm#10582

As we carry over the V0 metrics into V1, we are re-evaluating what makes sense and considering deprecating or re-working some metrics - see vllm-project/vllm#12745

The vllm:lora_requests_info added by vllm-project/vllm#9477 looks particularly unorthodox - it's value being a timestamp, and the status of adapters encoded as a comma-separated string in label:

vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09
vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09

At first blush, it seems like this would achieve the same thing in a more standard representation:

vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0

This sort of thing is separately proposed in vllm-project/vllm#11091. There would also be a lora_config Info metric which exposes max_lora

Feel free to discuss either here or in vllm-project/vllm#13303. Thanks!

The text was updated successfully, but these errors were encountered:

As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <[email protected]>

``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <[email protected]>

liu-cong · 2025-02-19T01:15:01Z

The metrics are currently used in the gateway-api-inference-extension project. While I agree the current implementation is not ideal, I'd like to make sure the new implementation doesn't break the dependency.

How to get the same running_lora_adapters information from the newly proposed metric? Will a lora adapter that's not running anymore show up with a value of 0 like so ?vllm:num_lora_requests_running{lora_name="non-running-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0, or will it not show up? The latter is preferred, since by iterating over all the lora_name labels we can essentially get the equivalent.

markmc · 2025-02-19T11:22:29Z

The metrics are currently used in the gateway-api-inference-extension project. While I agree the current implementation is not ideal, I'd like to make sure the new implementation doesn't break the dependency.

See #13504 - V1 will have a compatible implementation. However, it is quite likely we will deprecate and replace it sooner or later.

How to get the same running_lora_adapters information from the newly proposed metric? Will a lora adapter that's not running anymore show up with a value of 0 like so ?vllm:num_lora_requests_running{lora_name="non-running-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0, or will it not show up? The latter is preferred, since by iterating over all the lora_name labels we can essentially get the equivalent.

Good question, the proposal requires more work. But see the further discussion in #13303 - the current metric really isn't doing what you right now either AFAICT. If a LoRA had no scheduled requests in the most recent batch, its weights may still be loaded in GPU memory, and the vLLM instance should still be considered a good target for requests

liu-cong · 2025-02-19T17:08:03Z

the current metric really isn't doing what you right now either AFAICT.

The current implementation provides a good proxy in an environment where most lora adapters are actively serving traffic.

But agree with you. If we can get the "real" active lora metric in v1, that will be awesome.

kfswain · 2025-02-19T22:53:40Z

CC: @kfswain @kaushikmitr

liu-cong · 2025-02-21T05:03:44Z

Great discussion here and in #13303!

@courageJ pointed out that using timestamp as the value and some metadata as labels is an existing pattern in k8s, see the kube_pod_container_status_last_terminated_timestamp metric example in https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

And the semantics is "this is the last timestamp the server observes this state". I think we can clarify this with better documentation.

@markmc I saw vllm-project/vllm#13504 to reimplement this in v1, is this the current plan? If so, then I think we should be fine. Keep us in the loop if you'd like pursue further refactoring. And thanks again for finding us as one of the consumer and bringing this to our attention!

``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <[email protected]>

markmc mentioned this issue Feb 18, 2025

[WIP][Metrics] Re-work approach to LoRA metrics vllm-project/vllm#13303

Draft

5 tasks

markmc mentioned this issue Feb 18, 2025

[V1][Metrics] Implement vllm:lora_requests_info metric vllm-project/vllm#13504

Merged

ahg-g mentioned this issue Feb 18, 2025

Tracking issues for release 0.2.0 #362

Open

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider re-working the vLLM Gauge exposing the currently active LoRAs #354

Consider re-working the vLLM Gauge exposing the currently active LoRAs #354

markmc commented Feb 18, 2025

liu-cong commented Feb 19, 2025

markmc commented Feb 19, 2025

liu-cong commented Feb 19, 2025

kfswain commented Feb 19, 2025

liu-cong commented Feb 21, 2025

Consider re-working the vLLM Gauge exposing the currently active LoRAs #354

Consider re-working the vLLM Gauge exposing the currently active LoRAs #354

Comments

markmc commented Feb 18, 2025

liu-cong commented Feb 19, 2025

markmc commented Feb 19, 2025

liu-cong commented Feb 19, 2025

kfswain commented Feb 19, 2025

liu-cong commented Feb 21, 2025