[Deep.In. Article] AdaSpeech2: Adaptive Text to Speech with Untranscribed Data - DeepBrainAI
Like the AdaSpeech model we looked at last time, the existing TTS adaptation method has used text-speech pair data to synthesize the voices of a specific speaker. However, since it is practically difficult to prepare data in pairs, it will be a much more efficient way to adapt the TTS model only with speech data that is not transcribed. The easiest way to access is to use the automatic speech recognition (ASR) system for transcription, but it is difficult to apply in certain situations and recognition accuracy is not high enough, which can reduce final adaptation performance. And there have been attempts to solve this problem by joint training of the TTS pipeline and the module for adaptation, which has the disadvantage of not being able to easily combine with other commercial TTS models.