Skip to content

Whisper large v3 model repeats a lot #1507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sindresorhus opened this issue Nov 17, 2023 · 12 comments
Open

Whisper large v3 model repeats a lot #1507

sindresorhus opened this issue Nov 17, 2023 · 12 comments
Labels
question Further information is requested

Comments

@sindresorhus
Copy link
Contributor

I have gotten many reports that the large v3 model repeats sentences much more often than v2. I'm not sure if there's anything Whisper.cpp can do about this.

@ggerganov
Copy link
Member

If we can determine in some way that the problem is whisper.cpp related, then I'll look into it more.
But so far, my analysis indicate that the problem is in the v3 model itself as I observe similar issues with the OpenAI implementation.

@sindresorhus
Copy link
Contributor Author

Yeah, the problem seems to be the model itself: openai/whisper#1762 (comment)

The problem seems to occur during silence, so maybe Whisper.cpp could remove silence from audio?

@ggerganov
Copy link
Member

Removing silence from the audio is outside the scope of whisper.cpp because AFAIK there are many different algorithms to achieve this. It's better to leave it to the 3rd party to decide which one to use for their specific case.

@sindresorhus
Copy link
Contributor Author

Repetition on silences is a big problem with Whisper in general, large v3 just made it even worse. Having it built into Whisper.cpp would improve the general Whisper quality for all consumers, instead of every consumer having to implement a custom solution. It could be opt-in. I think it's worth considering.

In my case, I haven't found any good solutions for it that is not Python based. I need it to be C++/C/Swift.

@ggerganov
Copy link
Member

We can add a naive VAD as an optional pre-processing step, but I'm doubtful that it will help much, because the samples that I see failing with v3 do not contain silences.

Here are some strategies that I've observed to reduce repetition and hallucinations:

  • Use 5 beams
  • Increase entropy threshold from the default 2.4 to 2.8 for example. Higher threshold will reject repetitive text and fallback to sampling with higher temperature
  • Reduce the maximum context size (--max-context). By default it is 224. Setting it to 64 or 32 can reduce the repetitions significantly. Setting it to 0 will most likely eliminate all repetitions, but the transcription quality can be affected because it will be losing the context from the previous transcript

@bobqianic
Copy link
Collaborator

Repetition on silences is a big problem with Whisper in general

I think it's possible for someone to fine-tune the model using silence audio. Even the largest whisper model, which has 1.5 B parameters, is considered relatively small in the context of today's LLM : )

@bobqianic bobqianic added the question Further information is requested label Nov 18, 2023
@dubefab
Copy link

dubefab commented Nov 20, 2023

We can add a naive VAD as an optional pre-processing step, but I'm doubtful that it will help much, because the samples that I see failing with v3 do not contain silences.

Here are some strategies that I've observed to reduce repetition and hallucinations:

  • Use 5 beams
  • Increase entropy threshold from the default 2.4 to 2.8 for example. Higher threshold will reject repetitive text and fallback to sampling with higher temperature
  • Reduce the maximum context size (--max-context). By default it is 224. Setting it to 64 or 32 can reduce the repetitions significantly. Setting it to 0 will most likely eliminate all repetitions, but the transcription quality can be affected because it will be losing the context from the previous transcript

I tried this with large-v2 and it made it even better!

@jxy
Copy link
Contributor

jxy commented Nov 25, 2023

There is the no speech token that currently whisper.cpp ignores

https://github.com/ggerganov/whisper.cpp/blob/447d49530c9af41fe24f2ae510f452903dba330d/whisper.cpp#L4592

Actually implement no speech threshold similar to openai/whisper might help.

@ex3ndr
Copy link

ex3ndr commented Dec 3, 2023

I am trying to work-around this problem and VAD is not useful since if there even small silence interval it would emit something. whisper.cpp ignores "no speak" token while it is very crucial for it to work and it seems impossible to make it work without it.

@ex3ndr
Copy link

ex3ndr commented Dec 3, 2023

I have added a PR to return nosp token: #1588

@itsthisjustin
Copy link

  • Reduce the maximum context size (--max-context). By default it is 224. Setting it to 64 or 32 can reduce the repetitions significantly. Setting it to 0 will most likely eliminate all repetitions, but the transcription quality can be affected because it will be losing the context from the previous transcript

I don't see this as a param I can use in the Swift Package. What am I missing?

@aiyinyuedejustin
Copy link

same thing here...
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants