Skip to content

Voice anonymization question #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nilslacroix opened this issue Apr 21, 2025 · 2 comments
Open

Voice anonymization question #169

nilslacroix opened this issue Apr 21, 2025 · 2 comments

Comments

@nilslacroix
Copy link

I found these code pieces regarding voice anonymization:

if anonymization_only:
             chunk_ar_cond = self.ar_length_regulator(chunk[None])[0]
             chunk_ar_out = self.ar.generate(chunk_ar_cond, torch.zeros([1, 0]).long().to(device),
                                             compiled_decode_fn=self.compiled_decode_fn,
                                           top_p=top_p, temperature=temperature,
                                           repetition_penalty=repetition_penalty)
vc_mel = self.cfm.inference(
                        cat_condition,
                        torch.LongTensor([original_len]).to(device),
                        target_mel, target_style, diffusion_steps,
                        inference_cfg_rate=[intelligebility_cfg_rate, similarity_cfg_rate],
                        random_voice=anonymization_only,
                    )

Do you mind to eloborate how exactly this works or what your thought process was in comparison to normal voice conversion? From my understanding you just generate a random tensor in the first code piece and give this as a kind of random voice embedding to the inference? Would you mind going into detail how this works and if it is a truly random voice and how you come to that conclusion?

Would be really helpful <3

@Plachtaa
Copy link
Owner

voice anonymization is achieved by not conditioning generation on any timbre prompt. At the beginning I expect the generated timbre to be random, but it turns out instead of random voice, anonymization turns all source speeches into the same voice, which may be some kind of "average voice" of the train set. This is also a good sign indicating source speaker identity is completely removed by speech tokenizer.

@nilslacroix
Copy link
Author

interesting thanks for the clarification. maybe manipulating this "random average" voice could result in creating artificial voices. i think thats kinda novel, i dont know of any model where you can create an artificial voice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants