Poster SCGlowTTS Interspeech 2021

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior,
Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti
1. Introduction p
1.1 Motivation
– Recently, normalizing flows have been successfully applied in the TTS field. When the flow-based models FlowTron (Valle et
al., 2020) and Glow-TTS (Kim et al., 2020) achieved state-of-the-art results. Despite this, current zero-shot multi-speaker
TTS models were heavily based on the Tacotron 2 model.
1.2 Highlights
– As far as we know, this is the first work to explore flow-based models in a zero-shot multi-speaker TTS scenario.
– We show that fine-tuning a GAN-based vocoder with the Mel-spectrograms predicted by the TTS model in the training
speakers can significantly improve speech similarity and quality for new speakers.
– Our approach achieves promising results using only 11 speakers for training.
2. Methodology: Proposed Method and Dataset
2.1 Speaker Encoder
– Stack of 3 LSTM layers with a linear output layer.
– Trained using the Angular Prototypical loss function with approximately 25k speakers.
– Train datasets: LibriSpeech dataset, VoxCeleb V1 and V2, English version of Common Voice and VCTK.
2.2 Vocoder: HiFi-GAN V2
− VCTK dataset for training and validation.
− Fine-tuning with Mel-spectrograms predicted by TTS models
(HiFi-GAN-FT).
2.3 SC-GlowTTS Model: Glow-TTS based
− Phonemes instead of graphemes as input.
− Explore 3 different encoders:
The original transformer based encoder;
Residual convolutional based;
Gated convolutional based.
− External speaker embeddings conditioned in:
Affine coupling layers in all decoder blocks;
Duration predictor input.
2.4 Dataset: VCTK
− Training: composed of 97 speakers.
− Development: composed by samples from the 97 training speakers.
− Test: composed of 11 speakers not present in the training set.
Input Text Phonemizer Encoder
Duration Predictor
Conv Projection
Speaker Embedding
Aligment Generation
Ceil
Flow-Based Decoder
UnSqueeze
Affine Coupling Layer
Invertible 1x1 Conv
ActNorm
Squeeze
x 12
Predicted Mel spectrogram
HiFi-GAN
Waveform
3. Experiments: Setup and Results
3.1 Proposed Experiments
1. Tacotron 2 baseline following Jia et al. (2018) and Cooper et al. (2020);
2. SC-GlowTTS with transformer based encoder;
3. SC-GlowTTS with residual convolutional based encoder;
4. SC-GlowTTS with gated convolutional based encoder.
3.2 Experiments Setup
– All experiments were implemented on the Coqui TTS:
github.com/coqui-ai/TTS
– Coqui TTS is an open source TTS framework. Contributions are welcome.
– Audio samples and checkpoints of all experiments are available on:
github.com/Edresson/SC-GlowTTS
3.3 Results
Table 1. Real Time Factor, MOS and Sim-MOS with 95% confidence intervals and the SECS for all our experiments.
Experiment - Model Vocoder RTF (CPU - GPU) SECS MOS Sim-MOS
Ground Truth – – 0.9236 4.12 ± 0.06 4.127 ± 0.06
Attentron ZS (Choi et al., 2020) WaveRNN – (0.731) (3.86 ± 0.05) (3.30 ± 0.06)
1 - Tacotron 2
HiFi-GAN 0.5782 - 0.2485 0.7589 3.57 ± 0.08 3.867 ± 0.08
HiFi-GAN-FT - 0.7791 3.74 ± 0.08 3.951 ± 0.07
2 - SC-GlowTTS-Trans
HiFi-GAN 0.3612 - 0.1557 0.7641 3.65 ± 0.07 3.905 ± 0.07
HiFi-GAN-FT - 0.8046 3.78 ± 0.07 3.999 ± 0.07
3 - SC-GlowTTS-Res
HiFi-GAN 0.3597 - 0.1545 0.7440 3.45 ± 0.09 3.828 ± 0.08
HiFi-GAN-FT - 0.7969 3.70 ± 0.07 3.916 ± 0.07
4 - SC-GlowTTS-Gated
HiFi-GAN 0.3474 - 0.1437 0.7432 3.55 ± 0.08 3.852 ± 0.08
HiFi-GAN-FT - 0.7849 3.82 ± 0.07 3.952 ± 0.07
4. SC-GlowTTS performance with few speakers
– To emulate a scenario with few speakers we selected 11 speakers from the training subset of the VCTK dataset.
– We trained the SC-GlowTTS-Trans model on the single speaker dataset, LJ Speech, after we continued the training, in this
dataset composed of 11 speakers and we calculated the metrics for the test set.
– The model achieved a similarity MOS of 3.93±0.08 and a MOS of 3.71±0.07. These results are comparable to those achieved
by the Tacotron 2 baseline trained with 98 speakers which achieved a similarity MOS of 3.95±0.07 and a MOS of 3.74±0.08.
– We believe that this is an important step forward, especially for zero-shot multi speaker TTS in
low-resource languages.

Poster SCGlowTTS Interspeech 2021

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Poster SCGlowTTS Interspeech 2021

Semelhante a Poster SCGlowTTS Interspeech 2021 (20)

Mais de Bilkent University

Mais de Bilkent University (6)

Último

Último (20)

Poster SCGlowTTS Interspeech 2021