Skip to content

High Token Usage, Interruption Threshold Logic, and Audio Stream Mixing #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
abdulrahmanmajid opened this issue Apr 12, 2025 · 8 comments

Comments

@abdulrahmanmajid
Copy link

Hi, I’ve noticed a few issues while using open ai LLMs (gpt 4o and gpt 4o mini) and wanted to report them:

Unusually high input token usage:

In the usage analytics, every conversation shows at least 150k input tokens and around 5k output tokens, even for short interactions. I'm curious why the input token count is so high. Is this expected behavior or a potential issue? (this is without function calls and just a regular average system prompt.)

Interruption threshold behavior:

When the interruption threshold is set to 2 words, and the agent is silent, if the user says just one word (like okay or proceed), the agent doesn’t respond, and the LLM isn’t given that input. It seems the threshold still applies even when the agent isn’t speaking.
Ideally, the threshold should only be enforced while the agent is actively speaking to avoid accidental interruptions, not when the agent is silent. Could this behavior be improved to give it a more natural feel?

Ambient audio and TTS mixing:

Currently, when the agent speaks or TTS is triggered, all existing audio in the stream is cleared, including any ambient audio. This creates an unnatural experience, especially compared to other providers that seem to mix audio streams.
Would it be possible to add a function to mix multiple audio streams, so that ambient audio continues to play in the background even when the agent speaks, similar to how it works in real life where background noise doesn't stop when a person talks?

Thanks in advance.

@prateeksachan
Copy link
Member

hey, we send the entire prompt and append each conversation too to give the best context to the models.

The interruption scenario you're facing is likely due to some other scenario (can you share the agent payload?).

Regarding the ambient noise - this is something in the backlog. We released this feature as a beta only - but it needs major improvements. Thsi might take some time as we are swamped in work. Happy to help if anyone or you could contribute.

@abdulrahmanmajid
Copy link
Author

abdulrahmanmajid commented Apr 12, 2025

Got it thanks for the clarification

2nd Interruption Threshold
Here’s the full payload I’m currently using. Let me know if anything stands out that could be causing the unexpected interruptions:

"tasks": [
  {
    "task_type": "conversation",
    "toolchain": {
      "execution": "parallel",
      "pipelines": [
        [
          "transcriber",
          "llm",
          "synthesizer"
        ]
      ]
    },
    "tools_config": {
      "input": {
        "format": "wav",
        "provider": "twilio"
      },
      "llm_agent": {
        "agent_type": "simple_llm_agent",
        "agent_flow_type": "streaming",
        "routes": null,
        "llm_config": {
          "provider": "openai",
          "family": "openai",
          "max_tokens": 250,
          "temperature": 0.7,
          "top_p": 0.5,
          "presence_penalty": 0,
          "frequency_penalty": 0
        }
      },
      "output": {
        "format": "wav",
        "provider": "twilio"
      },
      "synthesizer": {
        "provider": "azuretts",
        "stream": true,
        "buffer_size": 500,
        "sampling_rate": 16000,
        "caching": true,
        "provider_config": {
          "voice": "AndrewMultilingual",
          "language": "en-US",
          "model": "Neural"
        }
      },
      "transcriber": {
        "provider": "deepgram",
        "model": "nova-3",
        "language": "en",
        "stream": true,
        "endpointing": 100
      }
    },
    "task_config": {
      "optimize_latency": true,
      "ambient_noise": true,
      "ambient_noise_track": "call-center",
      "incremental_delay": 0,
      "interruption_backoff_period": 50,
      "backchanneling": true,
      "backchanneling_message_gap": "5",
      "backchanneling_start_delay": "1",
      "use_fillers": false,
      "number_of_words_for_interruption": 2,
      "hangup_after_LLMCall": true
    }
  }
]

3rd Ambient Audio Contribution
I’d love to contribute to improving the ambient audio functionality. If you could share some direction, technical context, or a rough idea of how you envision it working, I’d be happy to take a shot at building it or prototyping something helpful. Also, if there’s anything else I could assist with or contribute to, I’d be excited to help out where possible

4rth LLM Hangup Question
One question I had. Currently, with hangup_after_LLMCall enabled, it seems like a separate LLM might be handling that detection. If the main LLM is prompted to persuade the user to stay on the call (convince the user not to hang up), it starts generating a response but the hangup LLM cuts in before it finishes, ending the call. Is this how its supposed to be? is there a way to give this decision making to the main LLM?

Thanks again!

@prateeksachan
Copy link
Member

hey @abdulrahmanmajid sorry for the late reply (i thought i had hit enter to comment, but it was stuck in draft).

you're facing the issue with interruptions because you're using ambient_noise. (this feature had certain issues which is meddling with other capablities). can you try interruptions with with ambient_noise: false once?

Woule love to have contributions. I understand that currently, it's quite messy (primarily because there's only 1 person working on it). I've started to clean up the code file-by-file hoping to structure and breakdown the task_manager.py in the end (which is the main & the largest file).

For ambient noise - it starts a task in the background which takes audio from a file, and streams it to output.

This isn't the correct approach IMHO. We should mix the ambient noise with the agent's audio and then stream it to the audio.

To your LLM Hangup Question: definitely this can be done but will need to check how to pass control to trigger the end_call from the main prompt itself. Is this something you can try out?

@abdulrahmanmajid
Copy link
Author

Hey @prateeksachan no worries at all, appreciate reply.

So, I did some digging and tested with ambient_noise: false, but unfortunately the issue with interruptions still persists. After going through the code, I noticed the current logic doesn't check whether the agent is actively speaking. It just applies the number_of_words_for_interruption threshold unconditionally. This means even when the agent is silent and waiting for input, the user still needs to speak a minimum number of words to trigger an LLM response.

I’ll work on refining that logic so that when the agent is expecting a response (ex silent or ex if the agent is speaking flag is off), the interruption word threshold doesn’t apply, should make interactions feel much more natural.

As for the ambient noise, I’ll try to modify this, I agree with your take instead of running it as a separate task streaming audio, it makes more sense to mix it with the agent’s speech before output (or maybe make it part of the base synth?). I’ll explore that direction and see what I can put together.

For the LLM hangup issue, I think function calling might help us trigger the end_call directly from within the main LLM flow. I’ll experiment with that and open a PR once I’ve got something working.

Thanks again for the guidance. I look forward to contributing!

@prateeksachan
Copy link
Member

prateeksachan commented Apr 16, 2025

number_of_words_for_interruption does check if the AI is still speaking or not (this logic is placed to check if the agent is still speaking or not).

And in the in the _listen_transcriber() we check for interim transcript and final transcript.

@abdulrahmanmajid
Copy link
Author

ah, you're right, I rechecked the logic and it does check for whether the agent is still speaking. Thanks for that.

I gave ambient_noise: false another try and it’s working as expected now.

That said, I think the flag could be improved, maybe instead of checking whether any audio is being sent to the user, we should specifically check if the agent is speaking. That way, ambient noise or backchanneling doesn’t interfere with interruption logic. I’ll go ahead and make that change and open a PR for review.

@prateeksachan
Copy link
Member

hey @abdulrahmanmajid as part of the recent release https://github.com/bolna-ai/bolna/releases/tag/0.9.6, I've fixed the hangup issue and it won't get cut midway now. Thanks for pointing this out.

@abdulrahmanmajid
Copy link
Author

@prateeksachan Hey, thanks for this, I’ve been working on the fix for ambient noise causing issues with the interruption task, Ive modified the is_audio_being_played_to_user flag to ignore streams or chunks with a sequence below 0 or -1 such as system audios will be ignored, ill create a PR soon and let you know.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants