Noisy outputs when running LJSpeech checkpoint on Tacotron mel spectrograms #4

patrickvonplaten · 2022-06-20T16:20:28Z

Thanks a lot for open-sourcing the checkpoint for the FastDiff vocoder for LJSpeech!

I played around with the code a bit and I'm only getting quite noisy generations when decoding the mel spectrogram of a tacotron with FastDiff's vocoder.

Here the code to reproduce:

#!/usr/bin/env python3
import torch
from modules.FastDiff.module.FastDiff_model import FastDiff
from utils import audio
from modules.FastDiff.module.util import compute_hyperparams_given_schedule, sampling_given_noise_schedule

HOP_SIZE = 256  # for 22050 frequency

# download checkpoint to this folder
state_dict = torch.load("./checkpoints/LJSpeech/model_ckpt_steps_500000.ckpt")["state_dict"]["model"]
model = FastDiff().cuda()
model.load_state_dict(state_dict)

train_noise_schedule = noise_schedule = torch.linspace(1e-06, 0.01, 1000)
diffusion_hyperparams = compute_hyperparams_given_schedule(noise_schedule)

# load noise schedule for 200 sampling steps
#noise_schedule = torch.linspace(0.0001, 0.02, 200).cuda()
# load noise schedule for 4 sampling steps
noise_schedule = torch.FloatTensor([3.2176e-04, 2.5743e-03, 2.5376e-02, 7.0414e-01]).cuda()

tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to("cuda").eval()

text = "Hello world, I missed you so much."
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

with torch.no_grad():
    mels, _, _ = tacotron2.infer(sequences, lengths)

audio_length = mels.shape[-1] * HOP_SIZE
pred_wav = sampling_given_noise_schedule(
    model, (1, 1, audio_length), diffusion_hyperparams, noise_schedule,
    condition=mels, ddim=False, return_sequence=False)

pred_wav = pred_wav / pred_wav.abs().max()
audio.save_wav(pred_wav.view(-1).cpu().float().numpy(), './test.wav', 22050)

After listening to test.wav one can identify the correct sentence but the output is extremely noisy. Any ideas what the reason for this could be? Are any of the hyper-parameters incorrectly set? Or does FastDiff only work with a certain type of Mel-spectrograms?

It would be very nice if you could take a quick look to check whether I have messed up some part of the code 😅

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2022-06-20T16:20:55Z

@Rongjiehuang it would be great if you could take a look :-)

Rongjiehuang · 2022-06-21T16:07:45Z

@patrickvonplaten Hi, the demo code has been updated in egs/, which I hope could be helpful. Besides, I find that this noisy output is due to the mel-preprocessing mismatch between the acoustic and vocoder models. And thus the mel output from this tacotron2 could not be properly vocoded.

For text-to-speech synthesis, PortaSpeech using this implementation + FastDiff is more likely to generate reasonable results.

Rongjiehuang · 2022-06-21T16:27:53Z

To vocode spectrograms generated from Tacotron, we need to retrain the FastDiff model with spectrograms which are processed in the same way.

patrickvonplaten · 2022-06-22T09:52:51Z

Hey @Rongjiehuang,

Thanks for answering so quickly here! It seems like in the example it is only shown how to vocode mel spectrograms that were derived from the source truth (i.e. the sound itself) - do you think it could be possible to also include an example of what text-to-mel spectrogram model should be used so that the user could see how well FastDiff performs on Text-to-Speech?

Rongjiehuang · 2022-06-24T16:07:31Z

Hi, the TTS example has been included, please refer to https://github.com/Rongjiehuang/FastDiff#inference-for-text-to-speech-synthesis

payymann · 2022-08-09T08:49:50Z

To vocode spectrograms generated from Tacotron, we need to retrain the FastDiff model with spectrograms which are processed in the same way.

@Rongjiehuang
Thank you for sharing your work. What you mean by "processed in the same way". Is there any configuration to do?

the6thsense · 2022-08-11T08:52:38Z

Ig they mean to say that the pre-processing should be same for both of them. I am currently working on a similar task and I'm possibly lost in the pre-processing steps of 2 different model repositories. Is there any sort of documentation for the requirements of preprocessing steps?

Rongjiehuang · 2022-08-12T11:51:04Z

@peyyman e.g., the normalizing stage in preprocessing.

payymann · 2022-08-14T06:54:37Z

@peyyman e.g., the normalizing stage in preprocessing.

Is it possible to train fastdiff with mel files (instead of wav files) as input using this code base? I mean is the required changes to code huge or with some tweaking it can be done?

Rongjiehuang · 2022-08-26T08:38:39Z

Sorry for the late reply. I have been working on dealing with this in the past few days. The LJSpeech checkpoint for neural vocoding of tacotron2 output and the corresponding script has been provided. Please refer to https://github.com/Rongjiehuang/FastDiff/#using-tacotron. If you want to train FastDiff(Tacotron) by yourself, use this config: modules/FastDiff/config/FastDiff_tacotron.yaml

patrickvonplaten changed the title ~~Running the new checkpoint for inference~~ Noisy outputs when running LJSpeech checkpoint on Tacotron mel spectrograms Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noisy outputs when running LJSpeech checkpoint on Tacotron mel spectrograms #4

Noisy outputs when running LJSpeech checkpoint on Tacotron mel spectrograms #4

patrickvonplaten commented Jun 20, 2022 •

edited

Loading

patrickvonplaten commented Jun 20, 2022

Rongjiehuang commented Jun 21, 2022

Rongjiehuang commented Jun 21, 2022

patrickvonplaten commented Jun 22, 2022

Rongjiehuang commented Jun 24, 2022

payymann commented Aug 9, 2022

the6thsense commented Aug 11, 2022

Rongjiehuang commented Aug 12, 2022

payymann commented Aug 14, 2022 •

edited

Loading

Rongjiehuang commented Aug 26, 2022

Noisy outputs when running LJSpeech checkpoint on Tacotron mel spectrograms #4

Noisy outputs when running LJSpeech checkpoint on Tacotron mel spectrograms #4

Comments

patrickvonplaten commented Jun 20, 2022 • edited Loading

patrickvonplaten commented Jun 20, 2022

Rongjiehuang commented Jun 21, 2022

Rongjiehuang commented Jun 21, 2022

patrickvonplaten commented Jun 22, 2022

Rongjiehuang commented Jun 24, 2022

payymann commented Aug 9, 2022

the6thsense commented Aug 11, 2022

Rongjiehuang commented Aug 12, 2022

payymann commented Aug 14, 2022 • edited Loading

Rongjiehuang commented Aug 26, 2022

patrickvonplaten commented Jun 20, 2022 •

edited

Loading

payymann commented Aug 14, 2022 •

edited

Loading