IndexTTS Fine-tuning Demo

This project is a demonstration of fine-tuning IndexTTS to generate speech with addtional special tags (such as <GIGGLES>), enabling the synthesis of non-textual elements like laughter.

Goals

Show you how to fine-tune IndexTTS's text Tokenizer (BPE) and AR part (GPT2).
Support for addtional special tags like <GIGGLES> in text to generate laughter.

Fine-tuning Dataset

🤗 MrDragonFox/Elise (Modelscope mirror)

Fine-tuning Experiment Results Example

Reference Audio	Text	Synthesized Speech
Female-1	Seriously? <giggles> That's the cutest thing I've ever heard!	Synthesized Speech
Female-1	真的吗？ <giggles> 这也太可爱了吧！	Synthesized Speech
Male-1	Wha—? Cute? <giggles> You think I'm cute?! Well, uh, thanks, I guess?	Synthesized Speech
Male-1	哎呀! 忘了他还在那等我们呢！<giggles> 我们两个动作得快点了！	Synthesized Speech

IndexTTS Architecture Overview

flowchart TD
    D("Reference Transcript") -->BPE[[**BPE**]] --> T(Text Token IDs)
    A("Reference Audio") --> M(Mel-Spectrogram) --> VAE[[*DiscreteVAE*]]--> B(Mel-Spec Code Ids)
    A -->CE[[*Conformer Encoder*]] --> Pe[[*Perceiver Resampler*]] --> CA(Audio Context Vector) -->|Conditioning| C
    B --> C
    T --> C[[**GPT2**]]
    C --> L("Latent Speech Representation")
    L --> V[["*BigVGAN*
    (Generator)"]]
    A --> SP[[*ECAPA-TDNN*]]--> S(Speaker Embedding)
    S --> V
    V -->|Synthesis| PCM("Waveform (PCM)") --> W("Synthesized Speech")

Modules Fine-tuned in This Project

BPE: Actually sentencepiece, this project show you how to adding new special tags such as <GIGGLES> to the text Tokenizer. See the preprocess_mel_dataset.ipynb notebook for details.
GPT2: The autoregressive model part, using the 🤗 peft library for LoRA fine-tuning, supporting the generation of speech latents for text with special tags. See the fine_tune_indextts.ipynb notebook for details.

Disclaimer

The reference audio files and the datasets used in this project are granted under the CC BY-NC-SA 4.0 license. They are used for the research and demonstration purposes of this project only, and are not intended for any commercial use. The synthesized audio files generated by this project are also not intended for commercial use.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
fine_tune_indextts.ipynb		fine_tune_indextts.ipynb
preprocess_mel_dataset.ipynb		preprocess_mel_dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IndexTTS Fine-tuning Demo

Goals

Fine-tuning Dataset

Fine-tuning Experiment Results Example

IndexTTS Architecture Overview

Modules Fine-tuned in This Project

Disclaimer

License

About

Uh oh!

Releases

Languages

License

yrom/finetune-index-tts

Folders and files

Latest commit

History

Repository files navigation

IndexTTS Fine-tuning Demo

Goals

Fine-tuning Dataset

Fine-tuning Experiment Results Example

IndexTTS Architecture Overview

Modules Fine-tuned in This Project

Disclaimer

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages