Skip to content

Quality and speed concerns/test for Turkish language #621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
emircanerkul opened this issue Jan 17, 2025 · 1 comment
Open

Quality and speed concerns/test for Turkish language #621

emircanerkul opened this issue Jan 17, 2025 · 1 comment

Comments

@emircanerkul
Copy link

Hi, I'm testing this model with Turkish i used one script showed in the docs, modified a bit to use gpu, but this wasn't good and each run it become worse.

from transformers import AutoProcessor, BarkModel
import gradio as gr
import torch
import pandas as pd

processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Read voice presets from CSV
df = pd.read_csv('data.csv', header=None)
voice_presets = {f"{row[0]} ({row[2]})": row[1] for _, row in df.iterrows()}

def bark(text, voice_preset):
    # Process text with padding and attention mask
    inputs = processor(
        text, voice_preset
    )

    # Move inputs to GPU first
    inputs = {
        k: v.to(device) if hasattr(v, 'to') else v
        for k, v in inputs.items()
    }

    # Get attention mask from the moved inputs
    # attention_mask = inputs["attention_mask"]

    audio_array = model.generate(
        input_ids=inputs["input_ids"],
        pad_token_id=processor.tokenizer.pad_token_id
    )
    audio_array = audio_array.cpu().numpy().squeeze()
    sample_rate = model.generation_config.sample_rate
    return (sample_rate, audio_array)

interface = gr.Interface(
    fn=bark,
    inputs=[
        gr.Textbox(label="Text to speak", placeholder="Enter text here..."),
        gr.Dropdown(
            choices=list(voice_presets.items()),
            value=list(voice_presets.items())[0][1],  # Set first value as default
            label="Voice"
        )
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="Bark Text-to-Speech",
    description="Generate speech from text using different voices",
)

if __name__ == "__main__":
    interface.launch()

Then i cloned huggingface space and it was better but still far from good for turkish language also generation speed become much more slower like ~1m

What do you suggest me?

@emircanerkul
Copy link
Author

Also, time to time,

  • it changes the speaker
  • put some different words end of the voice
  • put a lot of silences (can be 10s silence)
Image

With this version Turkish is not usable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant