Parallel WaveGAN review

Parallel WaveGAN
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
2, June. 2020.
A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
Ryuichi Yamomoto (LINE Corp.), Eunwoo Song(NAVER Corp.), Jae-Min KIM(NAVER Corp.)
ICASSP 2020 (2020.05.04 ~ 2020.05.08)

Abstract
• They propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a
generative adversarial network.
• In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution
spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the
realistic speech waveform.
• As their method does not require density distillation used in the conventional teacher-student framework, the entire
model can be easily trained.
• Furthermore, their model is able to generate high-fidelity speech even with its compact architecture.
• In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech
waveform 28.68 times faster than real-time on a single GPU environment.
• Perceptual listening test results verify that their proposed method achieves 4.16 mean opinion score within a
Transformer-based text-to-speech framework [1], which is comparative to the best distillation-based Parallel
WaveNet system.
[1] Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

Contribution
• They first proposed a technique to train a vocoder model using Adversarial Loss and Multi-resolution STFT loss
together.
• By suggesting a vocoder that does not use the Teacher-student framework, it significantly reduces learning and
synthesis time.

What is vocoder?
• Conventional method [2] uses the output of the MFB (Mel-frequency filter banks = Mel-spectrogram) 𝑀1, 𝑀2, … , 𝑀 𝑛 as
the input to Seq2Seq -based model and obtains the output through the vocoder.
• The encoder input in the Seq2Seq considers all the temporal information.
• The decoder predicts 𝑛 frames of MFB at once, thereby reducing the number of decoder steps to 𝑛/𝛾, where 𝛾 is the
reduction factor.
• Post-processing of linear scale spectrum 𝐹 is performed using CBHG (1D convolution bank, highway network,
bidirectional gated recurrent unit) module which results in 𝐹1, 𝐹2, … , 𝐹𝑛.
• The vocoder is essential to convert 𝐹 into a waveform expressed as 𝑆1
′
, 𝑆2
′
, … , 𝑆 𝑛
′
.
• The method uses the conventional autoregressive vocoder which predicts current step based on the previous input. Once
𝑆1 is obtained, 𝑆1
′
is used to predict 𝑆2
′
and finally 𝑆 𝑛
′
.
[2] Wang, Yuxuan, et al., “Tacotron: Toward end-to-end speech synthesis.”, arXiv preprint arXiv:1703.10135 (2017).
[3] June-Woo Kim, Ho-Young Jung, and Minho Lee. "Vocoder-free End-to-End Voice Conversion with Transformer Network." arXiv preprint arXiv:2002.03808 (2020).
Fig. 1. Conventional TTS method using MFB and vocoder [3]

What is Teacher-student Framework?
In general, more dataset and more deeper of neural network usually shows better performance.
• Ensemble method.
• But it takes huge time while backward and forward.
Therefore, various methods to make the structure of a large and complex model to small have been studied.
Teacher-student framework is one of them.
• Machines teach machines.
• Train a good and large teacher network first.
• Teacher network teaches student network

What is Teacher-student Framework? (2)
• The teacher-student framework method is also called to as Knowledge Distillation, which allows performance to be
better than when learning only with the student network.
• Many researches have been researched in the direction of reducing the model size in speech recognition and also
have been applied to remove the noise for robust speech recognition.
• Also Teacher-student framework used in Reinforcement Learning Approach [4].
• In Natural Language Processing domain, the TinyBERT [5] also applied Knowledge Distillation to accelerate
inference and reduce model size while maintaining accuracy.
• A large “teacher” BERT can be well transferred to a small “student” TinyBERT.
[4] Lisa Torrey and Matthew E. Taylor., “Teaching on a budget: Agents advising agents in reinforcement learning.”, In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2013.
[5] Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).

There are two methods of Knowledge Distillation
• The first is to transfer the class probability value, which is the output of the Softmax output layer. [6]
• 𝐿 𝐾𝐷 = 𝐾𝐿 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑓 𝑇 𝑥
𝜏
, 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑓 𝑆 𝑥
𝜏
• Where 𝐾𝐿() is Kullback-Leibler divergence, it used with cross entropy loss function.
[6] WG. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

• The second method is to transfer the output value of the hidden layer [7]
• Since the size of the hidden layer of the teacher and the student model maybe different, use the regressor function
as shown in the following equation:
• 𝐿 𝐻𝑇 = ||𝑓𝑇 𝑥 − 𝑟(𝑓𝑆 𝑥 )||2
2
.
[7] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets, ” in Proc. Int. Conf. Learn. Representations, 2015.

Raw waveform generation: Autoregressive (AR) vs. non-AR
Autoregressive models.
• Good: High-fidelity speech generation (e.g., WaveNet [8]).
• Bad: Generation is too slow.
Non-autoregressive models.
• Teacher-student-based methods (Parallel WaveNet [9], ClariNet [10]).
• Good: Real-time generation.
• Bad: Complicated two-stage training using probability density distillation.
[8] A.van den Oord et al., “WaveNet: A generative model for raw audio”, arXiv preprint arXiv:1609.03499, 2016.
[9] A.van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech synthesis”, in Proc. ICML, 2018.
[10] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech”, in Proc. ICLR, 2019.

Their approach: GANs for waveform generation
Parallel WaveGAN (Parallel inference + WaveNet + GAN)
• Distillation-free: a distillation-free fast waveform generation, combining multi-resolution STFT loss and adversarial
loss.
• Fast: Training and inference speed become 4.82 / 1.96 times faster than the conventional parallel WaveNet (i.e.
ClariNet).
• High-quality: Their model achieves 4.16 MOS (in Transformer-based TTS) that is competitive to the best
distillation-based ClariNet.
GAN-based methods can be good alternatives to distillation based methods.
STFT: Short-time Fourier transform
MOS: Mean-opinion score

Parallel WaveGAN: WaveNet-based generator
Architecture
• Generator architecture is almost the same as
WaveNet
Conditional waveform generation
• 80-dim mel-spectrogram as auxiliary features
Model comparison between WaveNet and theirs

STFT loss: Spectral convergence (SC) [11]
[11] Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal Processing Letters 26.1 (2018): 94-98.

STFT loss: Log-scale STFT magnitude loss [11]
[11] Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal Processing Letters 26.1 (2018): 94-98.

Parallel WaveGAN: Training overview

Experiments
1) Analysis/synthesis
2) Text-to-speech

Experimental conditions
Data & features
Vocoder model comparison
• Single Gaussian WaveNet
• ClariNet (single / three STFT losses)
• ClariNet-GAN (single / three STFT losses) [12]
• Parallel WaveGAN (single / three STFT losses)
Listening tests
• Mean-opinion score (MOS) listening test on quality and naturalness
• 18 native Japanese speakers / 20 random utterances for each model
[12] Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Probability density distillation with generative adversarial networks for high-quality parallel waveform generation.“, in Proc. INTERSPEECH, 2019

1)Analysis/synthesis: Effects of multi-resolution STFT loss
• Using multi-resolution STFT loss largely improved perceptual quality for both ClariNet and Parallel WaveGAN.

Training/inference time and model size comparison
All training was conducted on a server with two NVIDIA Tesla V100 GPUs.
All inference test was conduced on a server with a single NVIDIA Tesla V100 GPU.

2)Text-to-Speech: Perceptual quality evaluation
Their model achieved 4.16 MOS competitive to the best distillation-based ClariNet.

Conclusion
Goal
• Fast, high-quality and simple waveform generation for text-to-speech (TTS).
Proposed method
• Parallel WaveGAN, a distillation-free fast waveform generation, combining multi-resolution STFT loss and
adversarial loss.
Results
• Comparative perceptual quality (MOS 4.16 in Transformer-based TTS) to the best distillation-based method while
improving inference and training speed.
Take-home message: GAN-based methods can be good alternatives to distillation based methods.

Reference
• Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33.
2019.
• Wang, Yuxuan, et al., “Tacotron: Toward end-to-end speech synthesis.”, arXiv preprint arXiv:1703.10135 (2017).
• June-Woo Kim, Ho-Young Jung, and Minho Lee. "Vocoder-free End-to-End Voice Conversion with Transformer Network." arXiv preprint
arXiv:2002.03808 (2020).
• Lisa Torrey and Matthew E. Taylor., “Teaching on a budget: Agents advising agents in reinforcement learning.”, In International Conference on
Autonomous Agents and Multiagent Systems (AAMAS), May 2013.
• Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
• WG. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
• A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets, ” in Proc. Int. Conf. Learn.
Representations, 2015.
• A.van den Oord et al., “WaveNet: A generative model for raw audio”, arXiv preprint arXiv:1609.03499, 2016.
• A.van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech synthesis”, in Proc. ICML, 2018.
• W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech”, in Proc. ICLR, 2019.
• Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal
Processing Letters 26.1 (2018): 94-98.
• Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Probability density distillation with generative adversarial networks for high-quality
parallel waveform generation.“, in Proc. INTERSPEECH, 2019

Parallel WaveGAN review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallel WaveGAN review

Similar to Parallel WaveGAN review (20)

Recently uploaded

Recently uploaded (20)

Parallel WaveGAN review

Editor's Notes