SlideShare a Scribd company logo
1 of 23
Parallel WaveGAN
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
2, June. 2020.
A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
Ryuichi Yamomoto (LINE Corp.), Eunwoo Song(NAVER Corp.), Jae-Min KIM(NAVER Corp.)
ICASSP 2020 (2020.05.04 ~ 2020.05.08)
Abstract
• They propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a
generative adversarial network.
• In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution
spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the
realistic speech waveform.
• As their method does not require density distillation used in the conventional teacher-student framework, the entire
model can be easily trained.
• Furthermore, their model is able to generate high-fidelity speech even with its compact architecture.
• In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech
waveform 28.68 times faster than real-time on a single GPU environment.
• Perceptual listening test results verify that their proposed method achieves 4.16 mean opinion score within a
Transformer-based text-to-speech framework [1], which is comparative to the best distillation-based Parallel
WaveNet system.
[1] Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.
Contribution
• They first proposed a technique to train a vocoder model using Adversarial Loss and Multi-resolution STFT loss
together.
• By suggesting a vocoder that does not use the Teacher-student framework, it significantly reduces learning and
synthesis time.
What is vocoder?
• Conventional method [2] uses the output of the MFB (Mel-frequency filter banks = Mel-spectrogram) 𝑀1, 𝑀2, … , 𝑀 𝑛 as
the input to Seq2Seq -based model and obtains the output through the vocoder.
• The encoder input in the Seq2Seq considers all the temporal information.
• The decoder predicts 𝑛 frames of MFB at once, thereby reducing the number of decoder steps to 𝑛/𝛾, where 𝛾 is the
reduction factor.
• Post-processing of linear scale spectrum 𝐹 is performed using CBHG (1D convolution bank, highway network,
bidirectional gated recurrent unit) module which results in 𝐹1, 𝐹2, … , 𝐹𝑛.
• The vocoder is essential to convert 𝐹 into a waveform expressed as 𝑆1
′
, 𝑆2
′
, … , 𝑆 𝑛
′
.
• The method uses the conventional autoregressive vocoder which predicts current step based on the previous input. Once
𝑆1 is obtained, 𝑆1
′
is used to predict 𝑆2
′
and finally 𝑆 𝑛
′
.
[2] Wang, Yuxuan, et al., “Tacotron: Toward end-to-end speech synthesis.”, arXiv preprint arXiv:1703.10135 (2017).
[3] June-Woo Kim, Ho-Young Jung, and Minho Lee. "Vocoder-free End-to-End Voice Conversion with Transformer Network." arXiv preprint arXiv:2002.03808 (2020).
Fig. 1. Conventional TTS method using MFB and vocoder [3]
What is Teacher-student Framework?
In general, more dataset and more deeper of neural network usually shows better performance.
• Ensemble method.
• But it takes huge time while backward and forward.
Therefore, various methods to make the structure of a large and complex model to small have been studied.
Teacher-student framework is one of them.
• Machines teach machines.
• Train a good and large teacher network first.
• Teacher network teaches student network
What is Teacher-student Framework? (2)
• The teacher-student framework method is also called to as Knowledge Distillation, which allows performance to be
better than when learning only with the student network.
• Many researches have been researched in the direction of reducing the model size in speech recognition and also
have been applied to remove the noise for robust speech recognition.
• Also Teacher-student framework used in Reinforcement Learning Approach [4].
• In Natural Language Processing domain, the TinyBERT [5] also applied Knowledge Distillation to accelerate
inference and reduce model size while maintaining accuracy.
• A large “teacher” BERT can be well transferred to a small “student” TinyBERT.
[4] Lisa Torrey and Matthew E. Taylor., “Teaching on a budget: Agents advising agents in reinforcement learning.”, In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2013.
[5] Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
What is Teacher-student Framework? (3)
There are two methods of Knowledge Distillation
• The first is to transfer the class probability value, which is the output of the Softmax output layer. [6]
• 𝐿 𝐾𝐷 = 𝐾𝐿 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑓 𝑇 𝑥
𝜏
, 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑓 𝑆 𝑥
𝜏
• Where 𝐾𝐿() is Kullback-Leibler divergence, it used with cross entropy loss function.
[6] WG. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
What is Teacher-student Framework? (4)
• The second method is to transfer the output value of the hidden layer [7]
• Since the size of the hidden layer of the teacher and the student model maybe different, use the regressor function
as shown in the following equation:
• 𝐿 𝐻𝑇 = ||𝑓𝑇 𝑥 − 𝑟(𝑓𝑆 𝑥 )||2
2
.
[7] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets, ” in Proc. Int. Conf. Learn. Representations, 2015.
Raw waveform generation: Autoregressive (AR) vs. non-AR
Autoregressive models.
• Good: High-fidelity speech generation (e.g., WaveNet [8]).
• Bad: Generation is too slow.
Non-autoregressive models.
• Teacher-student-based methods (Parallel WaveNet [9], ClariNet [10]).
• Good: Real-time generation.
• Bad: Complicated two-stage training using probability density distillation.
[8] A.van den Oord et al., “WaveNet: A generative model for raw audio”, arXiv preprint arXiv:1609.03499, 2016.
[9] A.van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech synthesis”, in Proc. ICML, 2018.
[10] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech”, in Proc. ICLR, 2019.
Their approach: GANs for waveform generation
Parallel WaveGAN (Parallel inference + WaveNet + GAN)
• Distillation-free: a distillation-free fast waveform generation, combining multi-resolution STFT loss and adversarial
loss.
• Fast: Training and inference speed become 4.82 / 1.96 times faster than the conventional parallel WaveNet (i.e.
ClariNet).
• High-quality: Their model achieves 4.16 MOS (in Transformer-based TTS) that is competitive to the best
distillation-based ClariNet.
GAN-based methods can be good alternatives to distillation based methods.
STFT: Short-time Fourier transform
MOS: Mean-opinion score
Parallel WaveGAN: WaveNet-based generator
Architecture
• Generator architecture is almost the same as
WaveNet
Conditional waveform generation
• 80-dim mel-spectrogram as auxiliary features
Model comparison between WaveNet and theirs
STFT loss: Spectral convergence (SC) [11]
[11] Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal Processing Letters 26.1 (2018): 94-98.
STFT loss: Log-scale STFT magnitude loss [11]
[11] Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal Processing Letters 26.1 (2018): 94-98.
Multi-resolution STFT loss
Parallel WaveGAN: Training overview
Experiments
1) Analysis/synthesis
2) Text-to-speech
Experimental conditions
Data & features
Vocoder model comparison
• Single Gaussian WaveNet
• ClariNet (single / three STFT losses)
• ClariNet-GAN (single / three STFT losses) [12]
• Parallel WaveGAN (single / three STFT losses)
Listening tests
• Mean-opinion score (MOS) listening test on quality and naturalness
• 18 native Japanese speakers / 20 random utterances for each model
[12] Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Probability density distillation with generative adversarial networks for high-quality parallel waveform generation.“, in Proc. INTERSPEECH, 2019
1)Analysis/synthesis: Effects of multi-resolution STFT loss
• Using multi-resolution STFT loss largely improved perceptual quality for both ClariNet and Parallel WaveGAN.
Training/inference time and model size comparison
All training was conducted on a server with two NVIDIA Tesla V100 GPUs.
All inference test was conduced on a server with a single NVIDIA Tesla V100 GPU.
2)Text-to-Speech: Perceptual quality evaluation
Their model achieved 4.16 MOS competitive to the best distillation-based ClariNet.
Conclusion
Goal
• Fast, high-quality and simple waveform generation for text-to-speech (TTS).
Proposed method
• Parallel WaveGAN, a distillation-free fast waveform generation, combining multi-resolution STFT loss and
adversarial loss.
Results
• Comparative perceptual quality (MOS 4.16 in Transformer-based TTS) to the best distillation-based method while
improving inference and training speed.
Take-home message: GAN-based methods can be good alternatives to distillation based methods.
Reference
• Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33.
2019.
• Wang, Yuxuan, et al., “Tacotron: Toward end-to-end speech synthesis.”, arXiv preprint arXiv:1703.10135 (2017).
• June-Woo Kim, Ho-Young Jung, and Minho Lee. "Vocoder-free End-to-End Voice Conversion with Transformer Network." arXiv preprint
arXiv:2002.03808 (2020).
• Lisa Torrey and Matthew E. Taylor., “Teaching on a budget: Agents advising agents in reinforcement learning.”, In International Conference on
Autonomous Agents and Multiagent Systems (AAMAS), May 2013.
• Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
• WG. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
• A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets, ” in Proc. Int. Conf. Learn.
Representations, 2015.
• A.van den Oord et al., “WaveNet: A generative model for raw audio”, arXiv preprint arXiv:1609.03499, 2016.
• A.van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech synthesis”, in Proc. ICML, 2018.
• W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech”, in Proc. ICLR, 2019.
• Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal
Processing Letters 26.1 (2018): 94-98.
• Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Probability density distillation with generative adversarial networks for high-quality
parallel waveform generation.“, in Proc. INTERSPEECH, 2019
Thank you!

More Related Content

What's hot

Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice ConversionNU_I_TODALAB
 
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスJ-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスShinnosuke Takamichi
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice ConversionNU_I_TODALAB
 
音声の認識と合成
音声の認識と合成音声の認識と合成
音声の認識と合成Akinori Ito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
Attention is all you need
Attention is all you needAttention is all you need
Attention is all you needHoon Heo
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークNU_I_TODALAB
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用NU_I_TODALAB
 
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)Preferred Networks
 
Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic webWorawith Sangkatip
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
20120729 ODbL勉強会
20120729 ODbL勉強会20120729 ODbL勉強会
20120729 ODbL勉強会Shu Higashi
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知NU_I_TODALAB
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトNU_I_TODALAB
 
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現NU_I_TODALAB
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響NU_I_TODALAB
 
Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022Kwanghee Choi
 

What's hot (20)

音声分析合成[1].pptx
音声分析合成[1].pptx音声分析合成[1].pptx
音声分析合成[1].pptx
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスJ-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
音声の認識と合成
音声の認識と合成音声の認識と合成
音声の認識と合成
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
Attention is all you need
Attention is all you needAttention is all you need
Attention is all you need
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
 
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
 
Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic web
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
20120729 ODbL勉強会
20120729 ODbL勉強会20120729 ODbL勉強会
20120729 ODbL勉強会
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
 
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
 
Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 

Similar to Parallel WaveGAN review

Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewJune-Woo Kim
 
Audio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet TransformsAudio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet TransformsCSCJournals
 
A Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection ProtocolsA Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection Protocolsijtsrd
 
NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)
NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)
NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)IJCSEA Journal
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention reviewJune-Woo Kim
 
A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...ijma
 
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...ijma
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
Multiplexing And Data Rate
Multiplexing And Data RateMultiplexing And Data Rate
Multiplexing And Data RateLanate Drummond
 
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...ijtsrd
 
Application Of Flexible All Graphite Paper Based Field...
Application Of Flexible All Graphite Paper Based Field...Application Of Flexible All Graphite Paper Based Field...
Application Of Flexible All Graphite Paper Based Field...Emily Jones
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
 
Effect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech PerceptionEffect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment reviewJune-Woo Kim
 

Similar to Parallel WaveGAN review (20)

Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Audio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet TransformsAudio Steganography Coding Using the Discreet Wavelet Transforms
Audio Steganography Coding Using the Discreet Wavelet Transforms
 
A Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection ProtocolsA Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection Protocols
 
NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)
NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)
NUMERICAL STUDIES OF TRAPEZOIDAL PROTOTYPE AUDITORY MEMBRANE (PAM)
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...A robust audio watermarking in cepstrum domain composed of sample's relation ...
A robust audio watermarking in cepstrum domain composed of sample's relation ...
 
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
A Robust Audio Watermarking in Cepstrum Domain Composed of Sample's Relation ...
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Multiplexing And Data Rate
Multiplexing And Data RateMultiplexing And Data Rate
Multiplexing And Data Rate
 
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
Acoustic Scene Classification by using Combination of MODWPT and Spectral Fea...
 
Application Of Flexible All Graphite Paper Based Field...
Application Of Flexible All Graphite Paper Based Field...Application Of Flexible All Graphite Paper Based Field...
Application Of Flexible All Graphite Paper Based Field...
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
 
Effect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech PerceptionEffect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech Perception
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Gene's law
Gene's lawGene's law
Gene's law
 

Recently uploaded

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Recently uploaded (20)

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

Parallel WaveGAN review

  • 1. Parallel WaveGAN Presented by: June-Woo Kim Artificial Brain Research Lab., School of Sensor and Display, Kyungpook National University 2, June. 2020. A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram Ryuichi Yamomoto (LINE Corp.), Eunwoo Song(NAVER Corp.), Jae-Min KIM(NAVER Corp.) ICASSP 2020 (2020.05.04 ~ 2020.05.08)
  • 2. Abstract • They propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. • In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. • As their method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. • Furthermore, their model is able to generate high-fidelity speech even with its compact architecture. • In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. • Perceptual listening test results verify that their proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework [1], which is comparative to the best distillation-based Parallel WaveNet system. [1] Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.
  • 3. Contribution • They first proposed a technique to train a vocoder model using Adversarial Loss and Multi-resolution STFT loss together. • By suggesting a vocoder that does not use the Teacher-student framework, it significantly reduces learning and synthesis time.
  • 4. What is vocoder? • Conventional method [2] uses the output of the MFB (Mel-frequency filter banks = Mel-spectrogram) 𝑀1, 𝑀2, … , 𝑀 𝑛 as the input to Seq2Seq -based model and obtains the output through the vocoder. • The encoder input in the Seq2Seq considers all the temporal information. • The decoder predicts 𝑛 frames of MFB at once, thereby reducing the number of decoder steps to 𝑛/𝛾, where 𝛾 is the reduction factor. • Post-processing of linear scale spectrum 𝐹 is performed using CBHG (1D convolution bank, highway network, bidirectional gated recurrent unit) module which results in 𝐹1, 𝐹2, … , 𝐹𝑛. • The vocoder is essential to convert 𝐹 into a waveform expressed as 𝑆1 ′ , 𝑆2 ′ , … , 𝑆 𝑛 ′ . • The method uses the conventional autoregressive vocoder which predicts current step based on the previous input. Once 𝑆1 is obtained, 𝑆1 ′ is used to predict 𝑆2 ′ and finally 𝑆 𝑛 ′ . [2] Wang, Yuxuan, et al., “Tacotron: Toward end-to-end speech synthesis.”, arXiv preprint arXiv:1703.10135 (2017). [3] June-Woo Kim, Ho-Young Jung, and Minho Lee. "Vocoder-free End-to-End Voice Conversion with Transformer Network." arXiv preprint arXiv:2002.03808 (2020). Fig. 1. Conventional TTS method using MFB and vocoder [3]
  • 5. What is Teacher-student Framework? In general, more dataset and more deeper of neural network usually shows better performance. • Ensemble method. • But it takes huge time while backward and forward. Therefore, various methods to make the structure of a large and complex model to small have been studied. Teacher-student framework is one of them. • Machines teach machines. • Train a good and large teacher network first. • Teacher network teaches student network
  • 6. What is Teacher-student Framework? (2) • The teacher-student framework method is also called to as Knowledge Distillation, which allows performance to be better than when learning only with the student network. • Many researches have been researched in the direction of reducing the model size in speech recognition and also have been applied to remove the noise for robust speech recognition. • Also Teacher-student framework used in Reinforcement Learning Approach [4]. • In Natural Language Processing domain, the TinyBERT [5] also applied Knowledge Distillation to accelerate inference and reduce model size while maintaining accuracy. • A large “teacher” BERT can be well transferred to a small “student” TinyBERT. [4] Lisa Torrey and Matthew E. Taylor., “Teaching on a budget: Agents advising agents in reinforcement learning.”, In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2013. [5] Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
  • 7. What is Teacher-student Framework? (3) There are two methods of Knowledge Distillation • The first is to transfer the class probability value, which is the output of the Softmax output layer. [6] • 𝐿 𝐾𝐷 = 𝐾𝐿 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝑇 𝑥 𝜏 , 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑓 𝑆 𝑥 𝜏 • Where 𝐾𝐿() is Kullback-Leibler divergence, it used with cross entropy loss function. [6] WG. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • 8. What is Teacher-student Framework? (4) • The second method is to transfer the output value of the hidden layer [7] • Since the size of the hidden layer of the teacher and the student model maybe different, use the regressor function as shown in the following equation: • 𝐿 𝐻𝑇 = ||𝑓𝑇 𝑥 − 𝑟(𝑓𝑆 𝑥 )||2 2 . [7] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets, ” in Proc. Int. Conf. Learn. Representations, 2015.
  • 9. Raw waveform generation: Autoregressive (AR) vs. non-AR Autoregressive models. • Good: High-fidelity speech generation (e.g., WaveNet [8]). • Bad: Generation is too slow. Non-autoregressive models. • Teacher-student-based methods (Parallel WaveNet [9], ClariNet [10]). • Good: Real-time generation. • Bad: Complicated two-stage training using probability density distillation. [8] A.van den Oord et al., “WaveNet: A generative model for raw audio”, arXiv preprint arXiv:1609.03499, 2016. [9] A.van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech synthesis”, in Proc. ICML, 2018. [10] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech”, in Proc. ICLR, 2019.
  • 10. Their approach: GANs for waveform generation Parallel WaveGAN (Parallel inference + WaveNet + GAN) • Distillation-free: a distillation-free fast waveform generation, combining multi-resolution STFT loss and adversarial loss. • Fast: Training and inference speed become 4.82 / 1.96 times faster than the conventional parallel WaveNet (i.e. ClariNet). • High-quality: Their model achieves 4.16 MOS (in Transformer-based TTS) that is competitive to the best distillation-based ClariNet. GAN-based methods can be good alternatives to distillation based methods. STFT: Short-time Fourier transform MOS: Mean-opinion score
  • 11. Parallel WaveGAN: WaveNet-based generator Architecture • Generator architecture is almost the same as WaveNet Conditional waveform generation • 80-dim mel-spectrogram as auxiliary features Model comparison between WaveNet and theirs
  • 12. STFT loss: Spectral convergence (SC) [11] [11] Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal Processing Letters 26.1 (2018): 94-98.
  • 13. STFT loss: Log-scale STFT magnitude loss [11] [11] Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal Processing Letters 26.1 (2018): 94-98.
  • 17. Experimental conditions Data & features Vocoder model comparison • Single Gaussian WaveNet • ClariNet (single / three STFT losses) • ClariNet-GAN (single / three STFT losses) [12] • Parallel WaveGAN (single / three STFT losses) Listening tests • Mean-opinion score (MOS) listening test on quality and naturalness • 18 native Japanese speakers / 20 random utterances for each model [12] Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Probability density distillation with generative adversarial networks for high-quality parallel waveform generation.“, in Proc. INTERSPEECH, 2019
  • 18. 1)Analysis/synthesis: Effects of multi-resolution STFT loss • Using multi-resolution STFT loss largely improved perceptual quality for both ClariNet and Parallel WaveGAN.
  • 19. Training/inference time and model size comparison All training was conducted on a server with two NVIDIA Tesla V100 GPUs. All inference test was conduced on a server with a single NVIDIA Tesla V100 GPU.
  • 20. 2)Text-to-Speech: Perceptual quality evaluation Their model achieved 4.16 MOS competitive to the best distillation-based ClariNet.
  • 21. Conclusion Goal • Fast, high-quality and simple waveform generation for text-to-speech (TTS). Proposed method • Parallel WaveGAN, a distillation-free fast waveform generation, combining multi-resolution STFT loss and adversarial loss. Results • Comparative perceptual quality (MOS 4.16 in Transformer-based TTS) to the best distillation-based method while improving inference and training speed. Take-home message: GAN-based methods can be good alternatives to distillation based methods.
  • 22. Reference • Li, Naihan, et al. "Neural speech synthesis with transformer network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019. • Wang, Yuxuan, et al., “Tacotron: Toward end-to-end speech synthesis.”, arXiv preprint arXiv:1703.10135 (2017). • June-Woo Kim, Ho-Young Jung, and Minho Lee. "Vocoder-free End-to-End Voice Conversion with Transformer Network." arXiv preprint arXiv:2002.03808 (2020). • Lisa Torrey and Matthew E. Taylor., “Teaching on a budget: Agents advising agents in reinforcement learning.”, In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2013. • Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019). • WG. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets, ” in Proc. Int. Conf. Learn. Representations, 2015. • A.van den Oord et al., “WaveNet: A generative model for raw audio”, arXiv preprint arXiv:1609.03499, 2016. • A.van den Oord et al., “Parallel WaveNet: Fast high-fidelity speech synthesis”, in Proc. ICML, 2018. • W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech”, in Proc. ICLR, 2019. • Arık, Sercan Ö., Heewoo Jun, and Gregory Diamos. "Fast spectrogram inversion using multi-head convolutional neural networks." IEEE Signal Processing Letters 26.1 (2018): 94-98. • Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Probability density distillation with generative adversarial networks for high-quality parallel waveform generation.“, in Proc. INTERSPEECH, 2019

Editor's Notes

  1. Hello everyone, I am June-Woo Kim from ABR LAB. I will presenting the paper: Parallel WaveGAN
  2. 이 논문은 생성 적대 네트워크를 사용하여 증류가 없고(distillation-free) 빠르며 작은 발자국 파형 생성 방법 인 Parallel WaveGAN을 제안합니다. 제안 된 방법에서, 비-자동 회귀 WaveNet은 사실적인 음성 파형의 시간-주파수 분포를 효과적으로 포착 할 수 있는 multi-resolution spectrogram 및 adversarial loss function을 공동으로 최적화함으로써 훈련됩니다. 이들의 방법은 기존의 교사-학생 프레임 워크에서 사용되는 density distillation을 필요로하지 않기 때문에, 전체 모델을 쉽게 훈련시킬 수 있다. 또한이 모델은 컴팩트 한 아키텍처에서도 충실도 높은 음성을 생성 할 수 있습니다. 특히 제안 된 Parallel WaveGAN은 1.44M 매개 변수 만 가지며 단일 GPU 환경에서 실시간보다 28.68 배 빠른 24kHz 음성 파형을 생성 할 수 있습니다. 지각적인 청취 테스트 결과는 제안 된 방법이 트랜스포머 기반 텍스트 음성 변환 프레임 워크 내에서 4.16 평균 의견 점수를 달성하는지 확인합니다 [1]. 이는 최고의 증류 기반 병렬 WaveNet 시스템과 비교됩니다.
  3. 그들은 먼저 Adversarial Loss와 Multi-resolution STFT loss를 함께 사용하여 보코더 모델을 훈련시키는 기술을 제안했습니다. 교사-학생 프레임 워크를 사용하지 않는 보코더를 제안함으로써 학습 및 합성 시간이 크게 줄어 듭니다.
  4. 기존의 방법 [2]는 Seq2Seq 기반의 모델에 대한 입력으로 MFB (Mel-frequency filter banks = Mel-spectrogram) 𝑀_1, 𝑀_2,…, 𝑀_𝑛의 출력을 사용하고 보코더를 통해 출력을 얻습니다. Seq2Seq의 엔코더 입력은 모든 시간 정보를 고려합니다. 디코더는 MFB의 𝑛 프레임을 한번에 예측함으로써 디코더 스텝의 수를 𝑛 / 𝛾로 감소 시키며, 여기서 𝛾는 Reduction Factor(감소 인자)이다. 선형 스케일 스펙트럼 𝐹의 사후 처리는 CBHG (1D 컨볼 루션 뱅크, 고속도로 네트워크, 양방향 게이트 반복 단위) 모듈을 사용하여 수행되며, 결과적으로 𝐹_1, 𝐹_2,…, 𝐹_𝑛가 발생합니다. 보코더는 𝐹를 𝑆_1 ^ ′, 𝑆_2 ^ ′,…, 𝑆_𝑛 ^ ′로 표현 된 파형으로 변환하는 데 필수적입니다. 이 방법은 이전 입력을 기반으로 현재 단계를 예측하는 기존의 자동 회귀 보코더를 사용합니다. 𝑆_1을 얻으면 𝑆_1 ^ '을 사용하여 𝑆_2 ^'을 예측하고 마지막으로 𝑆_𝑛 ^ '를 예측합니다.
  5. 일반적으로 학습 데이터의 양이 많거나, 고차원 피처를 다루는 문제들은 신경 심층망의 구조가 크고 깊을수록 더 좋은 성능을 보임 또한 여러 구조의 신경 심층망 구조를 만든 뒤에 이들로부터 나오는 결과를 종합하는 앙상블 방법 또한 성능을 올리는 방법으로 많이 사용됨 The ensemble method of synthesizing the results from various neural networks outputs is also frequently used as a method of improving performance. 그러나 이러한 크고 깊은 모델이나 앙상블 된 모델은 계산량이 많아서 backward와 forward가 오래 걸린다는 단점이 있음 However, these large and deep models or ensembled models have a disadvantage in that they take a long time backward and forward due to the large amount of computation. Therefore, various methods to make the structure of a large and complex model to small have been studied. 따라서 크고 복잡한 모델의 구조를 작게 만드는 방법들이 다양하게 연구되고 있는데, 교사-학생 학습 방법이 그중 하나다. Train a good teacher network first, and then get a guide from the teacher network when learning the student network so that the student student network can imitate the teacher network.
  6. 교사-학생 학습 방법을 knowledge distillation 이라고도 부르는데, 이 방법을 통해 오로지 학생 네트워크로만 학습했을 때 보다 성능이 잘 나오도록 할 수 있음 음성 인식에서 모델 사이즈를 줄이는 방향으로 연구가 많이 진행되었으며, 몇몇 연구에서는 잡음에 강인한 음성 인식에도 응용이 되었음 Following Torrey and Taylor’s framework, an agent (the “teacher”) advises another one (the “student”) by suggesting actions the latter should take while learning a specific task in a sequential decision problem; the teacher is limited by a “budget” (the number of times such advice can be given). BERT 아래 Moreover, they introduced a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learing stages. This model achieves more than 96% the performance of teacher BERT_base on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference 또한 강화 학습 접근법에 사용되는 교사-학생 프레임 워크 [4]. 자연어 처리 영역에서 TinyBERT [5]는 또한 지식 증류를 적용하여 정확성을 유지하면서 추론을 가속화하고 모델 크기를 줄였습니다. 큰 "교사"BERT는 작은 "학생"TinyBERT로 잘 옮겨 질 수 있습니다.
  7. 지식 증류에는 두 가지 방법이 있습니다 첫 번째는 클래스 확률 값을 전송하는 것인데, 이는 Softmax 출력 레이어의 출력입니다. [6] 교사 네트워크와 학생 네트워크의 logit을 temperature 𝜏로 normalize 한 후, normalize된 logit의 class probability를 transfer 하는 방식으로 아래의 식과 같다. 𝐿_𝐾𝐷 = 𝐾𝐿 (𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (((𝑓_𝑇 (𝑥)) / 𝜏), 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ((𝑓_𝑆 (𝑥)) / 𝜏)) 𝐾𝐿 ()이 Kullback-Leibler 발산 인 경우 교차 엔트로피 손실 함수와 함께 사용됩니다. 이때 분포 A와 분포 B의 유사한 정도를 계산하는 방법이 KL-Divergence이다. KL-Divergence는 그 값이 작을 수록 두 분포가 유사하다는 것을 의미하고 값이 0이 되면 두 분포가 같은 분포라는 뜻이 된다. KL-Divergence 수식을 위해서는 Cross Entropy를 사용한다. pp를 qq로 설명하는 정보량을 뜻하는 Cross Entropy에서 pp가 자기자신을 설명하는 정보량인 pp의 엔트로피의 차이가 KL-Divergence가 된다. 
  8. 두 번째 방법은 은닉층의 출력값을 전송하는 것이다 [7] 교사 모델과 학생 모델의 hidden layer의 크기가 다를 수 있기 때문에, 아래의 식과 같이 regressor function을 사용하여 두 hidden layer 출력의 크기를 맞춰준 다음 transfer 되도록 한다. 교사 네트워크가 일반적으로 학생 네트워크보다 넓을 경우, 선택된 힌트 레이어는 안내 레이어보다 더 많은 출력을 가질 수 있습니다. 따라서이 백서에서는 안내 레이어에 회귀자를 추가하고 출력은 학생 레이어의 크기와 일치합니다. r은 매개 변수 f_s를 가진 안내 레이어 위에있는 회귀 함수입니다.
  9. 이제 배경에 대해 설명하겠습니다. 비 음성 파형 생성 작업을 진행한다고 했을 때, 이 문제에 대한 두 가지 주요 접근 방식은 자동 회귀 모델을 사용하는 것이고 다른 하나는 비 자기 회귀 모델을 사용하는 것입니다. 많은 사람들이 알고 있듯이 WavNet 및 원시 오디오에 대한 자동 회귀 모델은 고품질 음성 파형 생성을 달성했습니다. 그러나, 자기 회귀 샘플링 프로세스로 인해 추론 속도가 느려집니다. 한편, 자동 회귀 모델이 처음이며 실시간 파형 생성이 불가능합니다. 이제 자동 회귀 모델이없는 최첨단 방법은 Parallel WaveNet의 단순화 된 버전 인 Clarinet을 포함한 Parallel WaveNet입니다. 그러나 교사-학생 프레임 워크를 기반으로하는 복잡한 2 단계 교육으로 인해 이러한 모델을 학습하기가 어려운 경우가 많습니다. 훈련의 어려움은 많은 상황에서 대략 발행되었습니다.
  10. 이 논문에서. GAN 기반 파형 생성 모델 인 대체 Parallel WaveGAN을 제안했습니다. 여기에 요약 된 접근 방식의 이점은 다음과 같습니다. 먼저. 우리의 방법은 증류가 없습니다. Distillation-free를 사용하는 대신 multi-resolution STFT loss와 적대적 손실 기능을 결합하여 GANS를 조사합니다. GANS는 때때로 훈련하기가 어렵습니다. 그러나 다중 해상도 STFT를 사용하면 GAN 교육 프로세스에 큰 도움이됩니다. 둘째로. 우리 모델은 빠릅니다. 간단한 교육 절차와 생성기 설계 덕분에 교육 및 추론 속도가 각각 기존 Parallel WavNet보다 약 5 배 및 2 배 빨라졌습니다. 마지막으로 우리의 모델은 높은 지각 품질을 달성합니다. 청취 테스트 결과를 바탕으로, 본 모델은 트랜스포머 기반 텍스트 음성 변환에서 4.16 평균 의견 점수를 얻었으며, 이는 최고의 증류 기반 클라리넷과 경쟁이 치열합니다. GAN 기반 방법은 증류 기반 방법의 좋은 대안이 될 수 있습니다.
  11. 여기에서는 모델의 제너레이터 부분의 디자인에 대해서 설명하겠습니다. Parallel WavNet의 이름에서 알 수 있듯이 생성기 아키텍처는 WaveNet과 거의 동일합니다. 보코더로 조건부 파형 생성을 위해 WaveNet 보코더와 동일한 멜-스펙트로 그램으로 모델을 컨디셔닝합니다. 이 모델은 multiple residual convolution blocks으로 구성되며 dilated combinations (확장 조합)은 파형 레벨의 correlations을 효율적으로 학습하는데 사용됩니다. WaveNet과 모델의 차이점이 표에 요약되어 있습니다. (테이블 가르키며) Wavenet과 달리 Parallel WaveNet은 파형 샘플없이 직접 입력 및 출력으로 모든 시간 단계에 대해 랜덤 노이즈를 사용합니다.   또한 컨벌루션은 모델이 자동 회귀가 아니기 때문에 미래 정보와 과거 정보를 활용하여 비인 과적 컨벌루션을 사용할 수 있습니다.
  12. 그렇다면 어떻게 Parallel WaveNet을 훈련시킬 수 있는가? 논문에서 중요하다고 밝힌 것은 두 가지의 STFT 손실 기능입니다. 첫 번째는 Spectral Convergence이며 Spectroral Convergence는 처음에 스펙트로그램 반전 작업에 제안되었습니다. (ISTFT) 왼쪽 상단 그림은 음성의 STFT 크기(magnitude)를 보여줍니다. 왼쪽 아래 그림은 생성 된 음성의 STFT 크기(magnitude)를 보여줍니다. 오른쪽 그림에서 볼 수 있듯이 두 스펙트로 그램 간의 절대값 차이입니다. 특히 여기에 표시된 것처럼 저주파수 대역에서 큰 진폭의 성분을 강조합니다. (오른쪽 그림 맨 아랫 부분 수족관 며) 모델에 의해 만들어진 스펙트럼 변환은 음성 신호의 주요 주파수 성분입니다.
  13. 두 번째 STFT 손실은 로그 스케일 STFT 크기 손실입니다. 왼쪽 그림은 real 음성 및 생성 된 음성에 대한 로그 스케일 STFT 스펙트로 그램의 크기를 보여줍니다. 오른쪽 그림에서 볼 수 있듯이, 롱 스케일 스펙트로 그램 간의 절대 차이는 여기에 표시된 것처럼 모든 시간 주파수 빈에 널리 분포되어 있습니다. (동그라미 마우스) 로그 스케일 STFT 손실은 로그 함수 덕분에 작은 진폭 성분을 강조합니다. 특히 저주파 대역에서 고주파 대역까지 세부적인 스펙트럼 구조를 학습하는 데 도움이됩니다.
  14. 이 논문에서는 Parallel WaveGAN을 훈련시키기 위해 스펙트럼 변환과 log scale STFT loss와 multi resolution STFT loss 를 결합하여 사용할 것을 제안합니다. 논문에서는 단순히 각 STFT 손실에 대해 동일한 비율을 설정합니다. 여기서 최소화하려는 것은 STFT loss function과 FFT 크기, window 크기, frame shift과 같은 다른 분석 매개 변수를 선형으로 조합 한 것입니다. STFT 기반 시간-주파수 분석에서 주파수와 시간적 해상도 사이에는 트레이드 오프 관계가 있습니다. 예를 들어, 오른쪽 그림 (그림 그림 가르키며)에서 볼 수 있듯이 창 크기를 늘리면 일시적인 해상도는 줄이면서 더 높은 주파수 해상도를 제공합니다. 왼쪽 그림 (왼쪽 그림 가르키며)과 같이 창 크기를 사용하면 주파수 해상도를 어느 정도 잃으면 서 더 높은 시간 해상도를 제공합니다. 논문의 접근 방식에서는 여러 STFT 손실 기능을 결합하여 음성의 시간 주파수 특성을 효과적으로 학습 할 수 있습니다. 또한 모델이 고정 STFT 표현에 과적합되는 것을 방지하여 파형 영역에서 성능이 최적화되지 않을 수 있습니다.
  15. 다음은 교육 절차에 대한 개요입니다. 여기의 제너레이터 (빨간색 원)는 랜덤 노이즈와 멜-스펙트로그램을 입력으로 받아서 원시 파형을 출력합니다. 발전기는 적대적 손실과 함께 STFT 손실 함수 (여기서 STFT 손실 L_S (M) 가르킴)의 합을 최소화함으로써 훈련됩니다. Adversarial loss는 컨볼루션 기반 판별 기에서 파생되며, 판별 기는 자연스럽고 생성 된 음성을 입력으로 사용하여 올바르게 분류하는 방법을 배웁니다. 발전기와 같은 적대적인 훈련 과정을 사용함으로써 차별자가됩니다. 생성기는 현실적인 음성 파형의 분포를 학습하여 생성자가 판별자를 속일 수 있습니다.
  16. 제안 된 방법을 평가하기 위해 분석 합성과 텍스트 음성 변환의 두 가지 실험을 수행했습니다. 분석 합성에서, 멜-스펙트로 그램은 먼저 원시 파형으로부터 추출 된 다음 신경 보코더에 의해 파형이 생성되었다. 텍스트 음성 변환 시나리오에서, 트랜스포머 기반 음향 모델을 사용하여이 경우 음소 시퀀스에서 멜-스펙트로 그램을 예측했습니다. 그리고, 생성 된 멜-스펙트로 그램을 입력으로 사용하여 보코더 생성 출력 음성 파형을 생성한다.
  17. 실험과 조건은 다음과 같습니다. 여성 전문 일본인 스피커에 의해 녹음 된 약 24 시간의 데이터를 사용했습니다. 입력 음성의 feature로는 80 차원 멜-스펙트로 그램이 사용되었습니다. 비교를 위해 WaveNet, Clarinet, Clarinet GAN 및 Parallel WaveGAN 제안과 같은 보코더 모델을 테스트했습니다. ClariNet은 distillation free를 사용하는 기본 모델입니다. ClariNet GAN은 distillation free 및 GAN을 사용한 하이브리드 방식입니다. ClariNet 기반 모델 및 Parallel WaveGAN에 대한 다중 해상도 STFT 손실을 조사했습니다. 주관적인 평가를 위해 18 명의 일본어를 구사하는 사람에게 언어의 질을 평가하도록 요청하는 Mean Opinion Score 점수 듣기 테스트를 수행했습니다.
  18. 분석 및 합성 결과는 다음과 같습니다. 이 그림은 ClariNet, Parallel WaveGAN 및 기준 오디오에 대한 평균 의견 점수 청취 테스트 결과를 보여줍니다. 보다시피, 다중 해상도 STFT 손실을 사용하면 ClariNet 및 Parallel WaveGAN의 지각 품질이 크게 향상되었습니다. Parallel WaveGAN의 경우 성능이 특히 큽니다. 각 방법에 대한 오디오 샘플을 보여 드리겠습니다. 먼저 참조 오디오를 재생 한 다음 단일 해상도 손실이있는 클라리넷 샘플을 다중 해상도 STFT 손실이있는 Parallel WaveGAN으로 재생합니다. 단일 STFT 손실을 가진 Parallel WaveGAN의 품질이 그리 좋지 않다는 것을 알았을 것입니다. 실제로, 단일 STFT 손실을 갖는 Parallel WavGAN은 밀도 증류없이 가격 훈련 어려움에서 여기에 표시된 바와 같이 불량 점수를 얻었습니다 (못 나온 거 클릭). 그러나 Parralel WaveGAN은 다중 분해능 STFT 손실 기능을 활용하여 증류 기반 고객과 비교할 수있는 평균 의견 점수 4.06을 달성했습니다.
  19. 이 슬라이드에선 논문에서 언급한, 각 모델에 대한 추론 시간의 훈련 시간을 조사했습니다. 모든 실험은 NVIDIA GPU를 사용하여 수행되었습니다. 훈련 시간 비교를 위해 여기에서 볼 수 있듯이, (훈련 시간 가르키 기) 간단한 훈련 절차 덕분에 훈련 속도가 ClariNet보다 약 5 배 빨라졌습니다. 구체적으로 약 2 주가 3 일로 단축됩니다. 또한 단순한 WaveNet 기반 설계로 인해 추론 속도 및 매개 변수 수 (맨 오른쪽)는 추론 속도가 ClariNet보다 약 2 배 더 빨라졌으며 매개 변수 수도 작게 유지했습니다.
  20. 막대는 WaveNet, ClariNet, Clarinet GAN, Parallel WaveGAN의 평균 의견 점수를 나타냅니다. 그리고 실제 오디오입니다. ClariNet 기반 모델 및 Parallel WaveGAN에는 다중 해상도 STFT 손실이 사용되었습니다. 모든 방법은 음향 모델과 동일한 Transformer Network를 사용했습니다. 이 그림에서 WaveNet은 최악의 평균 의견 점수 3.33을 얻었음을 알 수 있습니다. 한편 ClariNet, ClariNet Gan 및 Parallel WaveGAN은 높은 평균 의견 점수를 달성했습니다. 결과는 다중 해상도 STFT 손실의 효과를 보여줍니다. 중요하게, Parallel WaveGAN은 여기 (짚기)와 같이 4.16의 최고 평균 의견 점수를 달성했습니다. 결과는 GAN 기반 방법이 distillation-free 기반 방법과 비슷한 성능을 달성 할 수 있음을 보여줍니다.
  21. 결론은 다음과 같습니다. 빠른 고품질 및 간단한 파형 생성을 하기 위해 multi-resolution STFT loss와 adversarial loss function 을 결합한 distillation-free의 방법 인 Parallel WaveGAN을 제안했습니다. 실험 결과는 이 모델이 추론 및 훈련 속도를 향상시키면서 최고의 distillation 기반 방법과 비슷한 품질을 달성했음을 입증했습니다. GAN 기반 방법은 증류 기반 방법의 좋은 대안이 될 수 있습니다. 관심이 있으시면 링크에서 더 많은 음성 샘플을 찾을 수 있습니다. 질문이 있으십니까? 나에게 편하게 연락해. 시청 해주셔서 감사합니다.