Authors: Naihan Li (Github page), Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou
Abstract: Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and
achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and
inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success
of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head
attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the
help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which
improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a
self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as
input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final
audio results. Experiments are conducted to test the efficiency and performance of our new network. For the
efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2.
For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance
(outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).
Samples generated by our model
“Two to five inches of rain is possible by early Monday, resulting in some
flooding.”
“In Texas, hospitals have passed along higher costs to local taxpayers.”
“Flooding is likely in some parishes of southern Louisiana.”
“Defending champ tiger woods is one of eight Golfers within two strokes.”
“Soon, average life expectancy will dip below forty years in ten African
countries.”
Comparison between our generated samples and recordings
“I cannot judge whether Stuart Taylor fell prey to this possibility.”
Our sample:
Recording:
“I'm telling you, the charming, graceful sick person is just a myth, an urban legend.”
Our sample:
Recording:
“Okay, now Nicole is making me recoil with horror.”
Our sample:
Recording:
“All filled out a questionnaire about their eating habits.”
Our sample:
Recording:
“Hotels, shops, restaurants, bars, and travel agencies are all conveniently located here.”