This project is a demonstration of fine-tuning IndexTTS to generate speech with addtional special tags (such as <GIGGLES>
), enabling the synthesis of non-textual elements like laughter.
- Show you how to fine-tune IndexTTS's text Tokenizer (BPE) and AR part (GPT2).
- Support for addtional special tags like
<GIGGLES>
in text to generate laughter.
🤗 MrDragonFox/Elise (Modelscope mirror)
Reference Audio | Text | Synthesized Speech |
---|---|---|
Female-1 | Seriously? <giggles> That's the cutest thing I've ever heard! | Synthesized Speech |
Female-1 | 真的吗? <giggles> 这也太可爱了吧! | Synthesized Speech |
Male-1 | Wha—? Cute? <giggles> You think I'm cute?! Well, uh, thanks, I guess? | Synthesized Speech |
Male-1 | 哎呀! 忘了他还在那等我们呢!<giggles> 我们两个动作得快点了! | Synthesized Speech |
flowchart TD
D("Reference Transcript") -->BPE[[**BPE**]] --> T(Text Token IDs)
A("Reference Audio") --> M(Mel-Spectrogram) --> VAE[[*DiscreteVAE*]]--> B(Mel-Spec Code Ids)
A -->CE[[*Conformer Encoder*]] --> Pe[[*Perceiver Resampler*]] --> CA(Audio Context Vector) -->|Conditioning| C
B --> C
T --> C[[**GPT2**]]
C --> L("Latent Speech Representation")
L --> V[["*BigVGAN*
(Generator)"]]
A --> SP[[*ECAPA-TDNN*]]--> S(Speaker Embedding)
S --> V
V -->|Synthesis| PCM("Waveform (PCM)") --> W("Synthesized Speech")
- BPE: Actually
sentencepiece
, this project show you how to adding new special tags such as<GIGGLES>
to the text Tokenizer. See the preprocess_mel_dataset.ipynb notebook for details. - GPT2: The autoregressive model part, using the 🤗 peft library for
LoRA
fine-tuning, supporting the generation of speech latents for text with special tags. See the fine_tune_indextts.ipynb notebook for details.
The reference audio files and the datasets used in this project are granted under the CC BY-NC-SA 4.0 license. They are used for the research and demonstration purposes of this project only, and are not intended for any commercial use. The synthesized audio files generated by this project are also not intended for commercial use.
This project is licensed under the MIT License.