SlideShare uma empresa Scribd logo
1 de 27
Bilgin Aksoy 18 Dec 2021
Intro to Text to Speech
Synthesis
Using Deep Learning
whoami
Bilgin Aksoy
• B.Sc. KHO 2003
• M.Sc. METU 2018
• 2003-2018 TAF
(Officer)
• 2018-2020 DataBoss
(Head of Data Science Department)
• 2020- ARINLABS
(Data Scientist)
• Linkedin: https://www.linkedin.com/in/bilgin-aksoy-a61a90110/
• Twitter: @blgnksy
Speech Synthesis / Text to Speech
Definiton
• Synthesizing intelligible, and natural speech from text.
• A research topic in natural language and speech processing.
• Requires knowledge about languages and human speech production.
• Involves multiple disciplines including linguistics, acoustics, digital signal
processing, and machine learning.
Speech Synthesis / Text to Speech
A Brief History
• Wolfgang von Kempelen had
constructed a speaking machine.
• Early methods: articulatory synthesis,
formant synthesis, and concatenative
synthesis.
• Later methods: statistical parametric
(spectrum, fundamental frequency, and
duration) speech synthesis (SPSS).
• From 2010s: neural network-based
speech synthesis.
Speech Synthesis / Text to Speech
Glossary
• Prosody: Intonation, stress, and rhythm.
• Phonemes: Units of sounds.
• Part-of-Speech: nouns, pronouns, verbs, adjectives, adverbs, prepositions,
conjunctions, articles/determiners, interjections.
• Vocoder: Decodes from features to audio signals.
• Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform.
• Alignment: Associating character/graphemes to phonemes.
• Duration: Represents how long the speech voice sound.
Speech Synthesis / Text to Speech
Glossary
• Mean Opinion Score (MOS): The most frequently used method to evaluate
the quality of the generated speech. MOS has a range from 0 to 5 where
real human speech is between 4.5 to 4.8
Speech Synthesis / Text to Speech
Sound Signal / Waveform
• Sampling rate: Sampling is the reduction of a continuous-time signal to a
discrete-time signal. Sampling rate is the number of total samples in a
second. (16/22 kHz)
• Sample Depth: The number of bits to represent of a sample’s value.
Speech Synthesis / Text to Speech
Spectrum of Sound Signal
Harmonics
Pitch
Human voice
ranges between
125 Hz to 8 kHz
Male F0 = 125 Hz
Female F0 = 200 Hz
Child F0 = 300 Hz
Speech Synthesis / Text to Speech
MEL Spectrum
• MEL spectrum: The mel-frequency cepstrum (MFC) is a representation of
the short-term power spectrum of a sound, based on a linear cosine
transform of a log power spectrum on a nonlinear mel scale of frequency.
Usually 80
Speech Synthesis / Text to Speech
Key Components
* Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
Speech Synthesis / Text to Speech
Text Analysis
• Text normalization,
• Word segmentation,
• Part-of-speech(POS) tagging,
• Prosody prediction,
• Character/grapheme-to-phoneme conversion (alignment).
Speech Synthesis / Text to Speech
Acoustic Model
• Inputs: Linguistic features or directly from phonemes or characters.
• Outputs: Acoustic features.
• RNN-based, CNN-based, Transformer-based.
Speech Synthesis / Text to Speech
Vocoder
• Part of the system decoding from acoustic features to audio signals/waveform.
Speech Synthesis / Text to Speech
Different Structures
* Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
Speech Synthesis / Text to Speech
Different Choices
• Single or multi speaker,
• Single or multi language,
• Single or multi gender.
Speech Synthesis / Text to Speech
WaveNet
Speech Synthesis / Text to Speech
DeepVoice 1/2/3
Added
speaker
embeddings
Speech Synthesis / Text to Speech
Tacotron 1/2
Speech Synthesis / Text to Speech
FastSpeech 1/2/2s
Speech Synthesis / Text to Speech
WaveGlow
Speech Synthesis / Text to Speech
HiFi-GAN
• GAN Architecture
• Generator: Fully Convolutional
• Discriminator:
• Multi-Period Discriminator
• Multi-Scale Discriminator
Speech Synthesis / Text to Speech
Other Models
• End-to-End Adversarial Text-to-Speech (EATS)
• WaveGAN
• MelGAN
• GAN-TTS
• Char2Wav
• ClariNet
• FastPitch
Speech Synthesis / Text to Speech
Datasets
• ARCTIC, VCTK, Blizzard-2011, Blizzard-2013, LJSpeech, LibriSpeech,
LibriTTS, VCC, HiFi-TTS, TED-LIUM, CALLHOME, RyanSpeech (English)
• CSMSC, HKUST, AISHELL-1, AISHELL-2, AISHELL-3, DiDiSpeech-1,
DiDiSpeech-2 (Mandarin)
• India Corpus, M-AILABS, MLS, CSS10, CommonVoice (Multilingual)
Speech Synthesis / Text to Speech
CommonVoice
Speech Synthesis / Text to Speech
CommonVoice
Speech Synthesis / Text to Speech
Resources
• DeepMind
• Google
• Microsoft
• Nvidia
• Coqui AI
• Mozilla TTS
• Nuance
Questions?

Mais conteúdo relacionado

Mais procurados

Speech recognition an overview
Speech recognition   an overviewSpeech recognition   an overview
Speech recognition an overviewVarun Jain
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceIlhaan Marwat
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognitionCharu Joshi
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminarDiptimaya Sarangi
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognitionfathitarek
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemREHMAT ULLAH
 
Hand Gesture Recognition Applications
Hand Gesture Recognition ApplicationsHand Gesture Recognition Applications
Hand Gesture Recognition ApplicationsImon_Barua
 
Homomorphic filtering
Homomorphic filteringHomomorphic filtering
Homomorphic filteringGautam Saxena
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 

Mais procurados (20)

Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech recognition an overview
Speech recognition   an overviewSpeech recognition   an overview
Speech recognition an overview
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
NLP
NLPNLP
NLP
 
Hand Gesture Recognition Applications
Hand Gesture Recognition ApplicationsHand Gesture Recognition Applications
Hand Gesture Recognition Applications
 
Homomorphic filtering
Homomorphic filteringHomomorphic filtering
Homomorphic filtering
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 

Semelhante a Introduction to text to speech

Survey On Speech Synthesis
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech SynthesisCSCJournals
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Guy De Pauw
 
Theories of speech perception.pptx
Theories of speech perception.pptxTheories of speech perception.pptx
Theories of speech perception.pptxsherin444916
 
Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interfaceJeevitha Reddy
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti1
 
Speech and Language Processing
Speech and Language ProcessingSpeech and Language Processing
Speech and Language ProcessingVikalp Mahendra
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptxMounika715343
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsIJCI JOURNAL
 

Semelhante a Introduction to text to speech (20)

Speech-Recognition.pptx
Speech-Recognition.pptxSpeech-Recognition.pptx
Speech-Recognition.pptx
 
Survey On Speech Synthesis
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech Synthesis
 
Speech processing
Speech processingSpeech processing
Speech processing
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
 
Theories of speech perception.pptx
Theories of speech perception.pptxTheories of speech perception.pptx
Theories of speech perception.pptx
 
FYPReport
FYPReportFYPReport
FYPReport
 
Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
 
Speech and Language Processing
Speech and Language ProcessingSpeech and Language Processing
Speech and Language Processing
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
visH (fin).pptx
visH (fin).pptxvisH (fin).pptx
visH (fin).pptx
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
Isolated English Word Recognition System: Appropriate for Bengali-accented En...Isolated English Word Recognition System: Appropriate for Bengali-accented En...
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
 
Voice
VoiceVoice
Voice
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
 

Último

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Último (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

Introduction to text to speech

  • 1. Bilgin Aksoy 18 Dec 2021 Intro to Text to Speech Synthesis Using Deep Learning
  • 2. whoami Bilgin Aksoy • B.Sc. KHO 2003 • M.Sc. METU 2018 • 2003-2018 TAF (Officer) • 2018-2020 DataBoss (Head of Data Science Department) • 2020- ARINLABS (Data Scientist) • Linkedin: https://www.linkedin.com/in/bilgin-aksoy-a61a90110/ • Twitter: @blgnksy
  • 3. Speech Synthesis / Text to Speech Definiton • Synthesizing intelligible, and natural speech from text. • A research topic in natural language and speech processing. • Requires knowledge about languages and human speech production. • Involves multiple disciplines including linguistics, acoustics, digital signal processing, and machine learning.
  • 4. Speech Synthesis / Text to Speech A Brief History • Wolfgang von Kempelen had constructed a speaking machine. • Early methods: articulatory synthesis, formant synthesis, and concatenative synthesis. • Later methods: statistical parametric (spectrum, fundamental frequency, and duration) speech synthesis (SPSS). • From 2010s: neural network-based speech synthesis.
  • 5. Speech Synthesis / Text to Speech Glossary • Prosody: Intonation, stress, and rhythm. • Phonemes: Units of sounds. • Part-of-Speech: nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions, articles/determiners, interjections. • Vocoder: Decodes from features to audio signals. • Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform. • Alignment: Associating character/graphemes to phonemes. • Duration: Represents how long the speech voice sound.
  • 6. Speech Synthesis / Text to Speech Glossary • Mean Opinion Score (MOS): The most frequently used method to evaluate the quality of the generated speech. MOS has a range from 0 to 5 where real human speech is between 4.5 to 4.8
  • 7. Speech Synthesis / Text to Speech Sound Signal / Waveform • Sampling rate: Sampling is the reduction of a continuous-time signal to a discrete-time signal. Sampling rate is the number of total samples in a second. (16/22 kHz) • Sample Depth: The number of bits to represent of a sample’s value.
  • 8. Speech Synthesis / Text to Speech Spectrum of Sound Signal Harmonics Pitch Human voice ranges between 125 Hz to 8 kHz Male F0 = 125 Hz Female F0 = 200 Hz Child F0 = 300 Hz
  • 9. Speech Synthesis / Text to Speech MEL Spectrum • MEL spectrum: The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Usually 80
  • 10. Speech Synthesis / Text to Speech Key Components * Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
  • 11. Speech Synthesis / Text to Speech Text Analysis • Text normalization, • Word segmentation, • Part-of-speech(POS) tagging, • Prosody prediction, • Character/grapheme-to-phoneme conversion (alignment).
  • 12. Speech Synthesis / Text to Speech Acoustic Model • Inputs: Linguistic features or directly from phonemes or characters. • Outputs: Acoustic features. • RNN-based, CNN-based, Transformer-based.
  • 13. Speech Synthesis / Text to Speech Vocoder • Part of the system decoding from acoustic features to audio signals/waveform.
  • 14. Speech Synthesis / Text to Speech Different Structures * Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
  • 15. Speech Synthesis / Text to Speech Different Choices • Single or multi speaker, • Single or multi language, • Single or multi gender.
  • 16. Speech Synthesis / Text to Speech WaveNet
  • 17. Speech Synthesis / Text to Speech DeepVoice 1/2/3 Added speaker embeddings
  • 18. Speech Synthesis / Text to Speech Tacotron 1/2
  • 19. Speech Synthesis / Text to Speech FastSpeech 1/2/2s
  • 20. Speech Synthesis / Text to Speech WaveGlow
  • 21. Speech Synthesis / Text to Speech HiFi-GAN • GAN Architecture • Generator: Fully Convolutional • Discriminator: • Multi-Period Discriminator • Multi-Scale Discriminator
  • 22. Speech Synthesis / Text to Speech Other Models • End-to-End Adversarial Text-to-Speech (EATS) • WaveGAN • MelGAN • GAN-TTS • Char2Wav • ClariNet • FastPitch
  • 23. Speech Synthesis / Text to Speech Datasets • ARCTIC, VCTK, Blizzard-2011, Blizzard-2013, LJSpeech, LibriSpeech, LibriTTS, VCC, HiFi-TTS, TED-LIUM, CALLHOME, RyanSpeech (English) • CSMSC, HKUST, AISHELL-1, AISHELL-2, AISHELL-3, DiDiSpeech-1, DiDiSpeech-2 (Mandarin) • India Corpus, M-AILABS, MLS, CSS10, CommonVoice (Multilingual)
  • 24. Speech Synthesis / Text to Speech CommonVoice
  • 25. Speech Synthesis / Text to Speech CommonVoice
  • 26. Speech Synthesis / Text to Speech Resources • DeepMind • Google • Microsoft • Nvidia • Coqui AI • Mozilla TTS • Nuance

Notas do Editor

  1. Text to speech (TTS), also known as speech synthesis, which aims to synthesize intelligible and natural speech from text [346], has broad applications in human communication [1] and has long been a research topic in artificial intelligence, natural language and speech processing. Developing a TTS system requires knowledge about languages and human speech production, and involves multiple disciplines including linguistics [63], acoustics [170], digital signal processing [320],and machine learning.
  2. In the 2nd half of the 18th century, the Hungarian scientist, Wolfgang von Kempelen, had constructed a speaking machine with a series of bellows, springs, bagpipes and resonance boxes to produce some simple words and short sentences. The first speech synthesis system that built upon computer came out in the latter half of the 20th century. The early computer-based speech synthesis methods include articulatory synthesis, formant synthesis, and concatenative synthesis. Articulatory Synthesis: Articulatory synthesis produces speech by simulating the behavior of human articulator such as lips, tongue, glottis and moving vocal tract. Formant Synthesis: Formant synthesis produces speech based on a set of rules that control a simplified source-filter model. These rules are usually developed by linguists to mimic the formant structure and other spectral properties of speech as closely as possible. The speech is synthesized by an additive synthesis module and an acoustic model with varying parameters like fundamental frequency, voicing, and noise levels. Concatenative Synthesis: Concatenative synthesis relies on the concatenation of pieces of speech that are stored in a database. Usually, the database consists of speech units ranging from whole sentence to syllables that are recorded by voice actors. Later, as the development of statistics machine learning, statistical parametric speech synthesis (SPSS) is proposed which predicts parameters such as spectrum, fundamental frequency and duration for speech synthesis. Statistical Parametric SynthesisTo address the drawbacks of concatenative TTS, statistical para-metric speech synthesis (SPSS) is proposed [416,356,415,425,357]. The basic idea is that instead of direct generating waveform through concatenation, we can first generate the acoustic parameters [82,355,156] that are necessary to produce speech and then recover speech from the generated acoustic parameters using some algorithms From 2010s, neural network-based speech synthesis has gradually become the dominant methods and achieved much better voice quality. Neural Speech Synthesis: As the development of deep learning, neural network-based TTS (neural TTS for short) is proposed, which adopts (deep) neural networks as the model backbone for speech synthesis. Some early neural models are adopted in SPSS to replace HMM for acoustic modeling. Later, WaveNet is proposed to directly generate waveform from linguistic features, which can be regarded as the first modern neural TTS model. Other models like DeepVoice 1/2 still follow the three components in statistical parametric synthesis, but upgrade them with the corresponding neural network based models. Furthermore, some end-to-end models (e.g., Tacotron1/2, Deep Voice 3, and FastSpeech 1/2) are proposed to simplify text analysis modules and directly take character/phoneme sequences as input, and simplify acoustic features with mel-spectrograms. Later, fully end-to-end TTS systems are developed to directly generate waveform from text, such as ClariNet, FastSpeech 2s and (DeepMind Introduces EATS – An End-to-End Adversarial Text-To-Speech) EATS. Compared to previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the advantages of neural network based speech synthesis include high voice quality in terms of both intelligibility and naturalness, and less requirement on human preprocessing and feature development.
  3. Prosody: intonation, stress, and rhythm. Phonemes: units of sounds. Kahır - ahır. Part-of-speech: Vocoder: Decodes from features to audio signals. Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform. Alignment: Associating character/graphemes to phonemes. Duration: Represents how long the speech voice sound.
  4. Mean Opinion Score (MOS): The most frequently used method to evaluate the quality of the generated speech. MOS has a range from 0 to 5 where real human speech is between 4.5 to 4.8.
  5. Text normalization. The raw written text (non-standard words) should be converted into spoken-form words through text normalization, which can make the words easy to pronounce for TTS models. For example, the year “1989” is normalized into “nineteen eighty nine”, “Jan. 24” isnormalized into “Janunary twenty-fourth”. Word segmentation. For character-based languages such as Chinese, word segmentation is necessary to detect the word boundary from raw text Part-of-speech tagging. The part-of-speech (POS) of each word, such as noun, verb, preposition, is also important for grapheme-to-phoneme conversion and prosody prediction in TTS. Prosody prediction. The prosody information, such as rhythm, stress, and intonation of speech, corresponds to the variations in syllable duration, loudness and pitch, which plays an important perceptual role in human speech communication.
  6. Acoustic models, which generate acoustic features from linguistic features or directly from phonemes or characters. As the development of TTS, different kinds of acoustic models have been adopted, including the early HMM and DNN based models in statistical parametric speech synthesis (SPSS), and then the sequence to sequence models based on encoder-attention-decoder framework (including LSTM, CNN and self-attention), and the latest feed-forward networks (CNN or self-attention) for parallel generation. Acoustic models aim to generate acoustic features that are further converted into waveform using vocoders. RNN-based Models (e.g., Tacotron Series) CNN-based Models (e.g., DeepVoice Series) DeepVoice [8] is actually an SPSS system enhanced with convolutional neural networks. After obtaining linguistic features through neural networks, DeepVoice leverages a WaveNet [254] based vocoder to generate waveform. Transformer-based Models (e.g., FastSpeech Series)
  7. Early neural vocoders such as WaveNet, Char2Wav, WaveRNN directly take linguistic features as input and generate waveform. Later, Prenger et al., Kim et al., Kumaret al., Yamamoto et al. take mel-spectrograms as input and generate waveform.
  8. fully end-to-end TTS models can generate speech waveform from character or phoneme sequence directly, which have the following advantages: 1) It requires less human annotation and feature development (e.g., alignment information between text and speech); 2) The joint and end-to-end optimization can avoid error propagation in cascaded models (e.g., Text Analysis + Acoustic Model +Vocoder); 3) It can also reduce the training, development and deployment cost. 1) Simplifying text analysis module and linguistic features. 2) Simplifying acoustic features, where the complicated acoustic features are simplified into mel-spectrograms. 3) Replacing two or three modules with a single end-to-end model. However, there are big challenges to train TTS models in an end-to-end way, mainly due to the different modalities between text and speech waveform, as well as the huge length mismatch between character/phoneme sequence and waveform sequence. For example, for a speech with a length of 5 seconds and about 20 words, the length of the phoneme sequence is just about 100, while the length of the waveform sequence is 110k (if the sample rate is 22kHz).
  9. Some TTS systems explicitly model the speaker representations through a speaker lookup table or speaker encoder.
  10. a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%. - Auto regressive - Casual convolution - Dilated convolution - Really slow for a real-life application. WaveNet was inspired by PixelCNN and PixelRNN, which are able to generate very complex natural images. Fast Wavenet Parallel Wavenet
  11. Deep Voice by Baidu, It consists of 4 different neural networks that together form an end-to-pipeline. A segmentation model that locates boundaries between phonemes. It is a hybrid CNN and RNN network that is trained to predict the alignment between vocal sounds and the target phoneme. A model that converts graphemes to phonemes. A model to predict phonemes duration and the fundamental frequencies. The same phoneme might hold different durations in different words. We need to predict the duration. Fundamental frequency for the pitch of each phoneme. A model to synthesize the final audio. Here the authors implemented a modified WaveNet. As you can see still follow the three components in statistical parametric synthesis, but upgrade them with the corresponding neural network based models. Deepvoice 2 Speaker embedding Deepvoice 3 a single model instead of four different ones. More specifically, the authors proposed a fully-convolutional character-to-spectrogram architecture which is ideal for parallel computation. As opposed to RNN-based models. They were also experimenting with different waveform synthesis methods with the WaveNet achieving the best results once again. 2000 speaker
  12. Tacotron was released by Google in 2017 as an end-to-end system. It is basically a sequence to sequence model that follows the familiar encoder-decoder architecture. An attention mechanism was also utilized. End2End Faster than WaveNet Character sequence => Audio Spectrogram => Synthesized Audio The encoder’s goal is to extract robust sequential representations of text. It receives a character sequence represented as one-hot encoding and through a stack of PreNets and CHBG modules, it outputs the final representation. PreNet is used to describe the non-linear transformations applied to each embedding. Content-based attention is used to pass the representation to the decoder, where a recurrent layer produces the attention query at each time step. The query is concatenated with the context vector and passed to a stack of GRU cells with residual connections. The output of the decoder is converted to the end waveform with a separate post-processing network, containing a CBHG module. No support for multi-speaker. Tacotron 2 Tacotron 2 improves and simplifies the original architecture. While there are no major differences, let’s see its key points: The encoder now consists of 3 convolutional layers and a bidirectional LSTM replacing PreNets and CHBG modules Location sensitive attention improved the original additive attention mechanism The decoder is now an autoregressive RNN formed by a Pre-Net, 2 uni-directional LSTMs, and a 5-layer Convolutional Post-Net A modified WaveNet is used as the Vocoder that follows PixelCNN++ and Parallel WaveNet Mel spectrograms are generated and passed to the Vocoder as opposed to Linear-scale spectrograms WaveNet replaced the Griffin-Lin algorithm used in Tacotron 1
  13. Through parallel mel-spectrogram generation, FastSpeech greatly speeds up the synthesis process Phoneme duration predictor ensures hard alignments between a phoneme and its mel- spectrograms, which is very different from soft and automatic attention alignments in the autoregressive models. he length regulator (Figure 1c) is used to solve the problem of length mismatch between the phoneme and spectrogram sequence The length regulator can easily adjust voice speed (voice speed or prosody control) 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accu- rate enough fastspeeech2/2s Same encoder transformer fft First, we remove the teacher-student distillation pipeline, and directly use ground-truth mel-spectrograms as target for model training, which can avoid the information loss in distilled mel-spectrograms and increase the upper bound of the voice quality. Second, our variance adaptor consists of not only duration predictor but also pitch and energy predictors.
  14. WaveGlow by Nvidia is one of the most popular flow-based TTS models. It essentially tries to combine insights from Glow and WaveNet in order to achieve fast and efficient audio synthesis without utilizing auto-regression. Note that WaveGlow is used strictly to generated speech from mel spectograms replacing WaveNets.
  15. The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Multi-Period Discriminator MPD is a mixture of sub-discriminators, each of which only accepts equally spaced samples of an input audio Multi-Scale Discriminator Because each sub-discriminator in MPD only accepts disjoint samples, we add MSD to consecutively evaluate the audio sequence. The architecture of MSD is drawn from that of MelGAN (Kumar et al., 2019). MSD is a mixture of three sub-discriminators operating on different input scales: raw audio, ×2 average-pooled audio, and ×4 average-pooled audio. GAN Loss Mel-Spectrogram Loss Feature Matching Loss similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. HiFi-GAN V 1 4.3