-
-
Notifications
You must be signed in to change notification settings - Fork 107
High Token Usage, Interruption Threshold Logic, and Audio Stream Mixing #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hey, we send the entire prompt and append each conversation too to give the best context to the models. The interruption scenario you're facing is likely due to some other scenario (can you share the agent payload?). Regarding the ambient noise - this is something in the backlog. We released this feature as a beta only - but it needs major improvements. Thsi might take some time as we are swamped in work. Happy to help if anyone or you could contribute. |
Got it thanks for the clarification 2nd Interruption Threshold
3rd Ambient Audio Contribution 4rth LLM Hangup Question Thanks again! |
hey @abdulrahmanmajid sorry for the late reply (i thought i had hit enter to comment, but it was stuck in draft). you're facing the issue with interruptions because you're using Woule love to have contributions. I understand that currently, it's quite messy (primarily because there's only 1 person working on it). I've started to clean up the code file-by-file hoping to structure and breakdown the task_manager.py in the end (which is the main & the largest file). For ambient noise - it starts a task in the background which takes audio from a file, and streams it to output. This isn't the correct approach IMHO. We should mix the ambient noise with the agent's audio and then stream it to the audio. To your LLM Hangup Question: definitely this can be done but will need to check how to pass control to trigger the end_call from the main prompt itself. Is this something you can try out? |
Hey @prateeksachan no worries at all, appreciate reply. So, I did some digging and tested with ambient_noise: false, but unfortunately the issue with interruptions still persists. After going through the code, I noticed the current logic doesn't check whether the agent is actively speaking. It just applies the number_of_words_for_interruption threshold unconditionally. This means even when the agent is silent and waiting for input, the user still needs to speak a minimum number of words to trigger an LLM response. I’ll work on refining that logic so that when the agent is expecting a response (ex silent or ex if the agent is speaking flag is off), the interruption word threshold doesn’t apply, should make interactions feel much more natural. As for the ambient noise, I’ll try to modify this, I agree with your take instead of running it as a separate task streaming audio, it makes more sense to mix it with the agent’s speech before output (or maybe make it part of the base synth?). I’ll explore that direction and see what I can put together. For the LLM hangup issue, I think function calling might help us trigger the end_call directly from within the main LLM flow. I’ll experiment with that and open a PR once I’ve got something working. Thanks again for the guidance. I look forward to contributing! |
And in the in the _listen_transcriber() we check for interim transcript and final transcript. |
ah, you're right, I rechecked the logic and it does check for whether the agent is still speaking. Thanks for that. I gave ambient_noise: false another try and it’s working as expected now. That said, I think the flag could be improved, maybe instead of checking whether any audio is being sent to the user, we should specifically check if the agent is speaking. That way, ambient noise or backchanneling doesn’t interfere with interruption logic. I’ll go ahead and make that change and open a PR for review. |
hey @abdulrahmanmajid as part of the recent release https://github.com/bolna-ai/bolna/releases/tag/0.9.6, I've fixed the hangup issue and it won't get cut midway now. Thanks for pointing this out. |
@prateeksachan Hey, thanks for this, I’ve been working on the fix for ambient noise causing issues with the interruption task, Ive modified the is_audio_being_played_to_user flag to ignore streams or chunks with a sequence below 0 or -1 such as system audios will be ignored, ill create a PR soon and let you know. Thanks. |
Hi, I’ve noticed a few issues while using open ai LLMs (gpt 4o and gpt 4o mini) and wanted to report them:
Unusually high input token usage:
In the usage analytics, every conversation shows at least 150k input tokens and around 5k output tokens, even for short interactions. I'm curious why the input token count is so high. Is this expected behavior or a potential issue? (this is without function calls and just a regular average system prompt.)
Interruption threshold behavior:
When the interruption threshold is set to 2 words, and the agent is silent, if the user says just one word (like okay or proceed), the agent doesn’t respond, and the LLM isn’t given that input. It seems the threshold still applies even when the agent isn’t speaking.
Ideally, the threshold should only be enforced while the agent is actively speaking to avoid accidental interruptions, not when the agent is silent. Could this behavior be improved to give it a more natural feel?
Ambient audio and TTS mixing:
Currently, when the agent speaks or TTS is triggered, all existing audio in the stream is cleared, including any ambient audio. This creates an unnatural experience, especially compared to other providers that seem to mix audio streams.
Would it be possible to add a function to mix multiple audio streams, so that ambient audio continues to play in the background even when the agent speaks, similar to how it works in real life where background noise doesn't stop when a person talks?
Thanks in advance.
The text was updated successfully, but these errors were encountered: