Introduction to text to speech

•Transferir como PPTX, PDF•

0 gostou•1,431 visualizações

Introduction to Text to Speech. A glossary, history, basic concepts, resources, public datasets, and state-of-the-art models.

Dados e análise

Bilgin Aksoy 18 Dec 2021
Intro to Text to Speech
Synthesis
Using Deep Learning

whoami
Bilgin Aksoy
• B.Sc. KHO 2003
• M.Sc. METU 2018
• 2003-2018 TAF
(Officer)
• 2018-2020 DataBoss
(Head of Data Science Department)
• 2020- ARINLABS
(Data Scientist)
• Linkedin: https://www.linkedin.com/in/bilgin-aksoy-a61a90110/
• Twitter: @blgnksy

Speech Synthesis / Text to Speech
Definiton
• Synthesizing intelligible, and natural speech from text.
• A research topic in natural language and speech processing.
• Requires knowledge about languages and human speech production.
• Involves multiple disciplines including linguistics, acoustics, digital signal
processing, and machine learning.

Speech Synthesis / Text to Speech
A Brief History
• Wolfgang von Kempelen had
constructed a speaking machine.
• Early methods: articulatory synthesis,
formant synthesis, and concatenative
synthesis.
• Later methods: statistical parametric
(spectrum, fundamental frequency, and
duration) speech synthesis (SPSS).
• From 2010s: neural network-based
speech synthesis.

Speech Synthesis / Text to Speech
Glossary
• Prosody: Intonation, stress, and rhythm.
• Phonemes: Units of sounds.
• Part-of-Speech: nouns, pronouns, verbs, adjectives, adverbs, prepositions,
conjunctions, articles/determiners, interjections.
• Vocoder: Decodes from features to audio signals.
• Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform.
• Alignment: Associating character/graphemes to phonemes.
• Duration: Represents how long the speech voice sound.

Speech Synthesis / Text to Speech
Glossary
• Mean Opinion Score (MOS): The most frequently used method to evaluate
the quality of the generated speech. MOS has a range from 0 to 5 where
real human speech is between 4.5 to 4.8

Speech Synthesis / Text to Speech
Sound Signal / Waveform
• Sampling rate: Sampling is the reduction of a continuous-time signal to a
discrete-time signal. Sampling rate is the number of total samples in a
second. (16/22 kHz)
• Sample Depth: The number of bits to represent of a sample’s value.

Speech Synthesis / Text to Speech
Spectrum of Sound Signal
Harmonics
Pitch
Human voice
ranges between
125 Hz to 8 kHz
Male F0 = 125 Hz
Female F0 = 200 Hz
Child F0 = 300 Hz

Speech Synthesis / Text to Speech
MEL Spectrum
• MEL spectrum: The mel-frequency cepstrum (MFC) is a representation of
the short-term power spectrum of a sound, based on a linear cosine
transform of a log power spectrum on a nonlinear mel scale of frequency.
Usually 80

Speech Synthesis / Text to Speech
Key Components
* Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).

Speech Synthesis / Text to Speech
Text Analysis
• Text normalization,
• Word segmentation,
• Part-of-speech(POS) tagging,
• Prosody prediction,
• Character/grapheme-to-phoneme conversion (alignment).

Speech Synthesis / Text to Speech
Acoustic Model
• Inputs: Linguistic features or directly from phonemes or characters.
• Outputs: Acoustic features.
• RNN-based, CNN-based, Transformer-based.

Speech Synthesis / Text to Speech
Vocoder
• Part of the system decoding from acoustic features to audio signals/waveform.

Speech Synthesis / Text to Speech
Different Structures
* Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).

Speech Synthesis / Text to Speech
Different Choices
• Single or multi speaker,
• Single or multi language,
• Single or multi gender.

Speech Synthesis / Text to Speech
WaveNet

Speech Synthesis / Text to Speech
DeepVoice 1/2/3
Added
speaker
embeddings

Speech Synthesis / Text to Speech
Tacotron 1/2

Speech Synthesis / Text to Speech
FastSpeech 1/2/2s

Speech Synthesis / Text to Speech
WaveGlow

Speech Synthesis / Text to Speech
HiFi-GAN
• GAN Architecture
• Generator: Fully Convolutional
• Discriminator:
• Multi-Period Discriminator
• Multi-Scale Discriminator

Speech Synthesis / Text to Speech
Other Models
• End-to-End Adversarial Text-to-Speech (EATS)
• WaveGAN
• MelGAN
• GAN-TTS
• Char2Wav
• ClariNet
• FastPitch

Speech Synthesis / Text to Speech
Datasets
• ARCTIC, VCTK, Blizzard-2011, Blizzard-2013, LJSpeech, LibriSpeech,
LibriTTS, VCC, HiFi-TTS, TED-LIUM, CALLHOME, RyanSpeech (English)
• CSMSC, HKUST, AISHELL-1, AISHELL-2, AISHELL-3, DiDiSpeech-1,
DiDiSpeech-2 (Mandarin)
• India Corpus, M-AILABS, MLS, CSS10, CommonVoice (Multilingual)

Speech Synthesis / Text to Speech
CommonVoice

Speech Synthesis / Text to Speech
Resources
• DeepMind
• Google
• Microsoft
• Nvidia
• Coqui AI
• Mozilla TTS
• Nuance

Mais conteúdo relacionado

Mais procurados

Speech RecognitionHardik Kanjariya

Speech recognition an overviewVarun Jain

Speech Recognition in Artificail InteligenceIlhaan Marwat

Speech recognitionCharu Joshi

Speech recognition system seminarDiptimaya Sarangi

Speech Recognition System By MatlabAnkit Gujrati

Speech Recognitionfathitarek

Speech Recognition TechnologySeminar Links

Introduction to Natural Language ProcessingPranav Gupta

Speech Recognition Systemcurrently enjoying the closing of the year and not working

Automatic Speech RecognitionInternational Islamic University

Artificial intelligence Speech recognition systemREHMAT ULLAH

NLPGirish Khanzode

Hand Gesture Recognition ApplicationsImon_Barua

Homomorphic filteringGautam Saxena

Deep Learning For Speech Recognitionananth

Voice RecognitionAmrita More

Natural Language Processing in AISaurav Shrestha

NAMED ENTITY RECOGNITIONlive_and_let_live

Natural Language Processing with PythonBenjamin Bengfort

Mais procurados (20)

Speech Recognition

Speech recognition an overview

Speech Recognition in Artificail Inteligence

Speech recognition

Speech recognition system seminar

Speech Recognition System By Matlab

Speech Recognition

Speech Recognition Technology

Introduction to Natural Language Processing

Speech Recognition System

Automatic Speech Recognition

Artificial intelligence Speech recognition system

NLP

Hand Gesture Recognition Applications

Homomorphic filtering

Deep Learning For Speech Recognition

Voice Recognition

Natural Language Processing in AI

NAMED ENTITY RECOGNITION

Natural Language Processing with Python

Semelhante a Introduction to text to speech

Speech-Recognition.pptxJyothiMedisetty2

Survey On Speech SynthesisCSCJournals

Speech processingIndian Institute of Technology Bhubaneswar

Do we need linguistic knowledge for speech technology applications in African...Guy De Pauw

Theories of speech perception.pptxsherin444916

FYPReportDavid Ferris

Silent sound interfaceJeevitha Reddy

SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti1

Research_Wu.pptxRakesh Pogula

Speech and Language ProcessingVikalp Mahendra

Automatic Speech RecognionInternational Islamic University

Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3

final ppt BATCH 3.pptxMounika715343

Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce

Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3

visH (fin).pptxtefflontrolegdy

Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit

Isolated English Word Recognition System: Appropriate for Bengali-accented En...International Journal of Science and Research (IJSR)

Voicereplay21

Direct Punjabi to English Speech Translation using Discrete UnitsIJCI JOURNAL

Semelhante a Introduction to text to speech (20)

Speech-Recognition.pptx

Survey On Speech Synthesis

Speech processing

Do we need linguistic knowledge for speech technology applications in African...

Theories of speech perception.pptx

FYPReport

Silent sound interface

SiddhantSancheti_MediumShortStory.pptx

Research_Wu.pptx

Speech and Language Processing

Automatic Speech Recognion

Segmentation Words for Speech Synthesis in Persian Language Based On Silence

final ppt BATCH 3.pptx

Implementation of Marathi Language Speech Databases for Large Dictionary

Segmentation Words for Speech Synthesis in Persian Language Based On Silence

visH (fin).pptx

Modeling of Speech Synthesis of Standard Arabic Using an Expert System

Isolated English Word Recognition System: Appropriate for Bengali-accented En...

Voice

Direct Punjabi to English Speech Translation using Discrete Units

Último

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Industrialised data - the key to AI success.pdfLars Albertsson

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Introduction to text to speech

1. Bilgin Aksoy 18 Dec 2021 Intro to Text to Speech Synthesis Using Deep Learning

2. whoami Bilgin Aksoy • B.Sc. KHO 2003 • M.Sc. METU 2018 • 2003-2018 TAF (Officer) • 2018-2020 DataBoss (Head of Data Science Department) • 2020- ARINLABS (Data Scientist) • Linkedin: https://www.linkedin.com/in/bilgin-aksoy-a61a90110/ • Twitter: @blgnksy

3. Speech Synthesis / Text to Speech Definiton • Synthesizing intelligible, and natural speech from text. • A research topic in natural language and speech processing. • Requires knowledge about languages and human speech production. • Involves multiple disciplines including linguistics, acoustics, digital signal processing, and machine learning.

4. Speech Synthesis / Text to Speech A Brief History • Wolfgang von Kempelen had constructed a speaking machine. • Early methods: articulatory synthesis, formant synthesis, and concatenative synthesis. • Later methods: statistical parametric (spectrum, fundamental frequency, and duration) speech synthesis (SPSS). • From 2010s: neural network-based speech synthesis.

5. Speech Synthesis / Text to Speech Glossary • Prosody: Intonation, stress, and rhythm. • Phonemes: Units of sounds. • Part-of-Speech: nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions, articles/determiners, interjections. • Vocoder: Decodes from features to audio signals. • Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform. • Alignment: Associating character/graphemes to phonemes. • Duration: Represents how long the speech voice sound.

6. Speech Synthesis / Text to Speech Glossary • Mean Opinion Score (MOS): The most frequently used method to evaluate the quality of the generated speech. MOS has a range from 0 to 5 where real human speech is between 4.5 to 4.8

7. Speech Synthesis / Text to Speech Sound Signal / Waveform • Sampling rate: Sampling is the reduction of a continuous-time signal to a discrete-time signal. Sampling rate is the number of total samples in a second. (16/22 kHz) • Sample Depth: The number of bits to represent of a sample’s value.

8. Speech Synthesis / Text to Speech Spectrum of Sound Signal Harmonics Pitch Human voice ranges between 125 Hz to 8 kHz Male F0 = 125 Hz Female F0 = 200 Hz Child F0 = 300 Hz

9. Speech Synthesis / Text to Speech MEL Spectrum • MEL spectrum: The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Usually 80

10. Speech Synthesis / Text to Speech Key Components * Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).

11. Speech Synthesis / Text to Speech Text Analysis • Text normalization, • Word segmentation, • Part-of-speech(POS) tagging, • Prosody prediction, • Character/grapheme-to-phoneme conversion (alignment).

12. Speech Synthesis / Text to Speech Acoustic Model • Inputs: Linguistic features or directly from phonemes or characters. • Outputs: Acoustic features. • RNN-based, CNN-based, Transformer-based.

13. Speech Synthesis / Text to Speech Vocoder • Part of the system decoding from acoustic features to audio signals/waveform.

14. Speech Synthesis / Text to Speech Different Structures * Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).

15. Speech Synthesis / Text to Speech Different Choices • Single or multi speaker, • Single or multi language, • Single or multi gender.

16. Speech Synthesis / Text to Speech WaveNet

17. Speech Synthesis / Text to Speech DeepVoice 1/2/3 Added speaker embeddings

18. Speech Synthesis / Text to Speech Tacotron 1/2

19. Speech Synthesis / Text to Speech FastSpeech 1/2/2s

20. Speech Synthesis / Text to Speech WaveGlow

21. Speech Synthesis / Text to Speech HiFi-GAN • GAN Architecture • Generator: Fully Convolutional • Discriminator: • Multi-Period Discriminator • Multi-Scale Discriminator

22. Speech Synthesis / Text to Speech Other Models • End-to-End Adversarial Text-to-Speech (EATS) • WaveGAN • MelGAN • GAN-TTS • Char2Wav • ClariNet • FastPitch

23. Speech Synthesis / Text to Speech Datasets • ARCTIC, VCTK, Blizzard-2011, Blizzard-2013, LJSpeech, LibriSpeech, LibriTTS, VCC, HiFi-TTS, TED-LIUM, CALLHOME, RyanSpeech (English) • CSMSC, HKUST, AISHELL-1, AISHELL-2, AISHELL-3, DiDiSpeech-1, DiDiSpeech-2 (Mandarin) • India Corpus, M-AILABS, MLS, CSS10, CommonVoice (Multilingual)

24. Speech Synthesis / Text to Speech CommonVoice

25. Speech Synthesis / Text to Speech CommonVoice

26. Speech Synthesis / Text to Speech Resources • DeepMind • Google • Microsoft • Nvidia • Coqui AI • Mozilla TTS • Nuance

27. Questions?

Notas do Editor

Text to speech (TTS), also known as speech synthesis, which aims to synthesize intelligible and natural speech from text [346], has broad applications in human communication [1] and has long been a research topic in artificial intelligence, natural language and speech processing. Developing a TTS system requires knowledge about languages and human speech production, and involves multiple disciplines including linguistics [63], acoustics [170], digital signal processing [320],and machine learning.
In the 2nd half of the 18th century, the Hungarian scientist, Wolfgang von Kempelen, had constructed a speaking machine with a series of bellows, springs, bagpipes and resonance boxes to produce some simple words and short sentences. The first speech synthesis system that built upon computer came out in the latter half of the 20th century. The early computer-based speech synthesis methods include articulatory synthesis, formant synthesis, and concatenative synthesis. Articulatory Synthesis: Articulatory synthesis produces speech by simulating the behavior of human articulator such as lips, tongue, glottis and moving vocal tract. Formant Synthesis: Formant synthesis produces speech based on a set of rules that control a simplified source-filter model. These rules are usually developed by linguists to mimic the formant structure and other spectral properties of speech as closely as possible. The speech is synthesized by an additive synthesis module and an acoustic model with varying parameters like fundamental frequency, voicing, and noise levels. Concatenative Synthesis: Concatenative synthesis relies on the concatenation of pieces of speech that are stored in a database. Usually, the database consists of speech units ranging from whole sentence to syllables that are recorded by voice actors. Later, as the development of statistics machine learning, statistical parametric speech synthesis (SPSS) is proposed which predicts parameters such as spectrum, fundamental frequency and duration for speech synthesis. Statistical Parametric SynthesisTo address the drawbacks of concatenative TTS, statistical para-metric speech synthesis (SPSS) is proposed [416,356,415,425,357]. The basic idea is that instead of direct generating waveform through concatenation, we can first generate the acoustic parameters [82,355,156] that are necessary to produce speech and then recover speech from the generated acoustic parameters using some algorithms From 2010s, neural network-based speech synthesis has gradually become the dominant methods and achieved much better voice quality. Neural Speech Synthesis: As the development of deep learning, neural network-based TTS (neural TTS for short) is proposed, which adopts (deep) neural networks as the model backbone for speech synthesis. Some early neural models are adopted in SPSS to replace HMM for acoustic modeling. Later, WaveNet is proposed to directly generate waveform from linguistic features, which can be regarded as the first modern neural TTS model. Other models like DeepVoice 1/2 still follow the three components in statistical parametric synthesis, but upgrade them with the corresponding neural network based models. Furthermore, some end-to-end models (e.g., Tacotron1/2, Deep Voice 3, and FastSpeech 1/2) are proposed to simplify text analysis modules and directly take character/phoneme sequences as input, and simplify acoustic features with mel-spectrograms. Later, fully end-to-end TTS systems are developed to directly generate waveform from text, such as ClariNet, FastSpeech 2s and (DeepMind Introduces EATS – An End-to-End Adversarial Text-To-Speech) EATS. Compared to previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the advantages of neural network based speech synthesis include high voice quality in terms of both intelligibility and naturalness, and less requirement on human preprocessing and feature development.
Prosody: intonation, stress, and rhythm. Phonemes: units of sounds. Kahır - ahır. Part-of-speech: Vocoder: Decodes from features to audio signals. Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform. Alignment: Associating character/graphemes to phonemes. Duration: Represents how long the speech voice sound.
Mean Opinion Score (MOS): The most frequently used method to evaluate the quality of the generated speech. MOS has a range from 0 to 5 where real human speech is between 4.5 to 4.8.
Text normalization. The raw written text (non-standard words) should be converted into spoken-form words through text normalization, which can make the words easy to pronounce for TTS models. For example, the year “1989” is normalized into “nineteen eighty nine”, “Jan. 24” isnormalized into “Janunary twenty-fourth”. Word segmentation. For character-based languages such as Chinese, word segmentation is necessary to detect the word boundary from raw text Part-of-speech tagging. The part-of-speech (POS) of each word, such as noun, verb, preposition, is also important for grapheme-to-phoneme conversion and prosody prediction in TTS. Prosody prediction. The prosody information, such as rhythm, stress, and intonation of speech, corresponds to the variations in syllable duration, loudness and pitch, which plays an important perceptual role in human speech communication.
Acoustic models, which generate acoustic features from linguistic features or directly from phonemes or characters. As the development of TTS, different kinds of acoustic models have been adopted, including the early HMM and DNN based models in statistical parametric speech synthesis (SPSS), and then the sequence to sequence models based on encoder-attention-decoder framework (including LSTM, CNN and self-attention), and the latest feed-forward networks (CNN or self-attention) for parallel generation. Acoustic models aim to generate acoustic features that are further converted into waveform using vocoders. RNN-based Models (e.g., Tacotron Series) CNN-based Models (e.g., DeepVoice Series) DeepVoice [8] is actually an SPSS system enhanced with convolutional neural networks. After obtaining linguistic features through neural networks, DeepVoice leverages a WaveNet [254] based vocoder to generate waveform. Transformer-based Models (e.g., FastSpeech Series)
Early neural vocoders such as WaveNet, Char2Wav, WaveRNN directly take linguistic features as input and generate waveform. Later, Prenger et al., Kim et al., Kumaret al., Yamamoto et al. take mel-spectrograms as input and generate waveform.
fully end-to-end TTS models can generate speech waveform from character or phoneme sequence directly, which have the following advantages: 1) It requires less human annotation and feature development (e.g., alignment information between text and speech); 2) The joint and end-to-end optimization can avoid error propagation in cascaded models (e.g., Text Analysis + Acoustic Model +Vocoder); 3) It can also reduce the training, development and deployment cost. 1) Simplifying text analysis module and linguistic features. 2) Simplifying acoustic features, where the complicated acoustic features are simplified into mel-spectrograms. 3) Replacing two or three modules with a single end-to-end model. However, there are big challenges to train TTS models in an end-to-end way, mainly due to the different modalities between text and speech waveform, as well as the huge length mismatch between character/phoneme sequence and waveform sequence. For example, for a speech with a length of 5 seconds and about 20 words, the length of the phoneme sequence is just about 100, while the length of the waveform sequence is 110k (if the sample rate is 22kHz).
Some TTS systems explicitly model the speaker representations through a speaker lookup table or speaker encoder.
a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%. - Auto regressive - Casual convolution - Dilated convolution - Really slow for a real-life application. WaveNet was inspired by PixelCNN and PixelRNN, which are able to generate very complex natural images. Fast Wavenet Parallel Wavenet
Deep Voice by Baidu, It consists of 4 different neural networks that together form an end-to-pipeline. A segmentation model that locates boundaries between phonemes. It is a hybrid CNN and RNN network that is trained to predict the alignment between vocal sounds and the target phoneme. A model that converts graphemes to phonemes. A model to predict phonemes duration and the fundamental frequencies. The same phoneme might hold different durations in different words. We need to predict the duration. Fundamental frequency for the pitch of each phoneme. A model to synthesize the final audio. Here the authors implemented a modified WaveNet. As you can see still follow the three components in statistical parametric synthesis, but upgrade them with the corresponding neural network based models. Deepvoice 2 Speaker embedding Deepvoice 3 a single model instead of four different ones. More specifically, the authors proposed a fully-convolutional character-to-spectrogram architecture which is ideal for parallel computation. As opposed to RNN-based models. They were also experimenting with different waveform synthesis methods with the WaveNet achieving the best results once again. 2000 speaker
Tacotron was released by Google in 2017 as an end-to-end system. It is basically a sequence to sequence model that follows the familiar encoder-decoder architecture. An attention mechanism was also utilized. End2End Faster than WaveNet Character sequence => Audio Spectrogram => Synthesized Audio The encoder’s goal is to extract robust sequential representations of text. It receives a character sequence represented as one-hot encoding and through a stack of PreNets and CHBG modules, it outputs the final representation. PreNet is used to describe the non-linear transformations applied to each embedding. Content-based attention is used to pass the representation to the decoder, where a recurrent layer produces the attention query at each time step. The query is concatenated with the context vector and passed to a stack of GRU cells with residual connections. The output of the decoder is converted to the end waveform with a separate post-processing network, containing a CBHG module. No support for multi-speaker. Tacotron 2 Tacotron 2 improves and simplifies the original architecture. While there are no major differences, let’s see its key points: The encoder now consists of 3 convolutional layers and a bidirectional LSTM replacing PreNets and CHBG modules Location sensitive attention improved the original additive attention mechanism The decoder is now an autoregressive RNN formed by a Pre-Net, 2 uni-directional LSTMs, and a 5-layer Convolutional Post-Net A modified WaveNet is used as the Vocoder that follows PixelCNN++ and Parallel WaveNet Mel spectrograms are generated and passed to the Vocoder as opposed to Linear-scale spectrograms WaveNet replaced the Griffin-Lin algorithm used in Tacotron 1
Through parallel mel-spectrogram generation, FastSpeech greatly speeds up the synthesis process Phoneme duration predictor ensures hard alignments between a phoneme and its mel- spectrograms, which is very different from soft and automatic attention alignments in the autoregressive models. he length regulator (Figure 1c) is used to solve the problem of length mismatch between the phoneme and spectrogram sequence The length regulator can easily adjust voice speed (voice speed or prosody control) 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accu- rate enough fastspeeech2/2s Same encoder transformer fft First, we remove the teacher-student distillation pipeline, and directly use ground-truth mel-spectrograms as target for model training, which can avoid the information loss in distilled mel-spectrograms and increase the upper bound of the voice quality. Second, our variance adaptor consists of not only duration predictor but also pitch and energy predictors.
WaveGlow by Nvidia is one of the most popular flow-based TTS models. It essentially tries to combine insights from Glow and WaveNet in order to achieve fast and efficient audio synthesis without utilizing auto-regression. Note that WaveGlow is used strictly to generated speech from mel spectograms replacing WaveNets.
The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Multi-Period Discriminator MPD is a mixture of sub-discriminators, each of which only accepts equally spaced samples of an input audio Multi-Scale Discriminator Because each sub-discriminator in MPD only accepts disjoint samples, we add MSD to consecutively evaluate the audio sequence. The architecture of MSD is drawn from that of MelGAN (Kumar et al., 2019). MSD is a mixture of three sub-discriminators operating on different input scales: raw audio, ×2 average-pooled audio, and ×4 average-pooled audio. GAN Loss Mel-Spectrogram Loss Feature Matching Loss similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. HiFi-GAN V 1 4.3

Introduction to text to speech

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Introduction to text to speech

Semelhante a Introduction to text to speech (20)

Último

Último (20)

Introduction to text to speech

Notas do Editor