You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to know the right way to get first token latency and steady state token latency using llama-batch-bench i prefer the llama-batched-bench over llama-bench since batched-bench can perform multiple streams
My understanding is that
Steady-state latency : The average time it takes to produce each token is once your model and data pipeline are fully warmed up and running continuously
First-token latency The time from triggering a model inference to seeing the very first token of its output, including any one-time setup costs.
i used the below code snippet getting 6 permutation npp * ntg for 4 (npl) stream is this right way way validate the model latency calculcation
Further help required to get the latency per token based on latency i will choose the right LLM model in my application
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to know the right way to get first token latency and steady state token latency using llama-batch-bench i prefer the llama-batched-bench over llama-bench since batched-bench can perform multiple streams
My understanding is that
Steady-state latency : The average time it takes to produce each token is once your model and data pipeline are fully warmed up and running continuously
First-token latency The time from triggering a model inference to seeing the very first token of its output, including any one-time setup costs.
i used the below code snippet getting 6 permutation npp * ntg for 4 (npl) stream is this right way way validate the model latency calculcation
Further help required to get the latency per token based on latency i will choose the right LLM model in my application
`
TIME_FORMAT_USAGE="Memory: %M: kilobytes, CPU: %P: percent"
OUTPUT_FILE="llama-batched-bench-log.txt"
/usr/local/bin/time -f
"$TIME_FORMAT_USAGE"
"$EXEC"
-m "$MODEL_PATH"
-pps --file prompts.txt
-npp 128,256,512
-ntg 128,256
-npl 4
-t "$THREADS"
--temp 0.7 --repeat_penalty 1.1
--output-format jsonl
-v >> "$OUTPUT_FILE" 2>&1
echo "completed all runs - check the log file $OUTPUT_FILE"
`Beta Was this translation helpful? Give feedback.
All reactions