Skip to content

yrom/finetune-index-tts

Repository files navigation

IndexTTS Fine-tuning Demo

中文说明

This project is a demonstration of fine-tuning IndexTTS to generate speech with addtional special tags (such as <GIGGLES>), enabling the synthesis of non-textual elements like laughter.

Goals

  • Show you how to fine-tune IndexTTS's text Tokenizer (BPE) and AR part (GPT2).
  • Support for addtional special tags like <GIGGLES> in text to generate laughter.

Fine-tuning Dataset

🤗 MrDragonFox/Elise (Modelscope mirror)

Fine-tuning Experiment Results Example

Reference Audio Text Synthesized Speech
Female-1 Seriously? <giggles> That's the cutest thing I've ever heard! Synthesized Speech
Female-1 真的吗? <giggles> 这也太可爱了吧! Synthesized Speech
Male-1 Wha—? Cute? <giggles> You think I'm cute?! Well, uh, thanks, I guess? Synthesized Speech
Male-1 哎呀! 忘了他还在那等我们呢!<giggles> 我们两个动作得快点了! Synthesized Speech

IndexTTS Architecture Overview

flowchart TD
    D("Reference Transcript") -->BPE[[**BPE**]] --> T(Text Token IDs)
    A("Reference Audio") --> M(Mel-Spectrogram) --> VAE[[*DiscreteVAE*]]--> B(Mel-Spec Code Ids)
    A -->CE[[*Conformer Encoder*]] --> Pe[[*Perceiver Resampler*]] --> CA(Audio Context Vector) -->|Conditioning| C
    B --> C
    T --> C[[**GPT2**]]
    C --> L("Latent Speech Representation")
    L --> V[["*BigVGAN*
    (Generator)"]]
    A --> SP[[*ECAPA-TDNN*]]--> S(Speaker Embedding)
    S --> V
    V -->|Synthesis| PCM("Waveform (PCM)") --> W("Synthesized Speech")
Loading

Modules Fine-tuned in This Project

  • BPE: Actually sentencepiece, this project show you how to adding new special tags such as <GIGGLES> to the text Tokenizer. See the preprocess_mel_dataset.ipynb notebook for details. Open In Colab
  • GPT2: The autoregressive model part, using the 🤗 peft library for LoRA fine-tuning, supporting the generation of speech latents for text with special tags. See the fine_tune_indextts.ipynb notebook for details. Open In Colab

Disclaimer

The reference audio files and the datasets used in this project are granted under the CC BY-NC-SA 4.0 license. They are used for the research and demonstration purposes of this project only, and are not intended for any commercial use. The synthesized audio files generated by this project are also not intended for commercial use.

License

This project is licensed under the MIT License.