Text-to-speech conversion is getting smart with time, but the drawback is that it will still take an excessive amount of training time and resources to build a natural-sounding product.
Tech giant Microsoft with the help of Chinese engineers might have a more efficient process as they've created a text-to-speech AI that can make a realistic speech by using only 200 voice samples.
It is capable of creating matching transcriptions as well.
The Al calculate in part on Transformers or deep neural networks that roughly follow neurons in the brain.
Transformers consider every input and output on the fly like synaptic links which helps the system operate the complex sentences very efficiently.
It is combined with noise-removing encoder, which makes the text-to-speech AI more effective.
The results are not as flawless as a slight robotic sound is still causing an issue, but they're remarkably precise with a word intelligibility of almost 100 percent.
In particular, this could make text-to-speech more feasible for everyone if it gets in reach of small companies.
Photo: Getty Images
Read next: Are AI-based digital assistants reinforcing harmful gender stereotypes?
Tech giant Microsoft with the help of Chinese engineers might have a more efficient process as they've created a text-to-speech AI that can make a realistic speech by using only 200 voice samples.
It is capable of creating matching transcriptions as well.
The Al calculate in part on Transformers or deep neural networks that roughly follow neurons in the brain.
Transformers consider every input and output on the fly like synaptic links which helps the system operate the complex sentences very efficiently.
It is combined with noise-removing encoder, which makes the text-to-speech AI more effective.
The results are not as flawless as a slight robotic sound is still causing an issue, but they're remarkably precise with a word intelligibility of almost 100 percent.
In particular, this could make text-to-speech more feasible for everyone if it gets in reach of small companies.
"We have proposed the almost unsupervised method for text to speech and automatic speech recognition, which leverages only few paired speech and text data and extra unpaired data. Our method consists of several keys components, including denoising auto-encoder, dual transformation, bidirectional sequence modeling, and a unified model structure to incorporate the above components. We can achieve 99.84% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR with just 200 paired data on LJSpeech dataset, demonstrating the effectiveness of our method. The further analyses verify the importance of each component of our method.", explained the study in a paper. Adding further, "For future work, we will push toward the limit of unsupervised learning by purely leveraging unpaired speech and text data, with the help of other pre-training methods. We will also leverage an advanced model for the vocoder instead of Griffin-Lim, such as WaveNet, to enhance the quality of the generated audio."
Photo: Getty Images
Read next: Are AI-based digital assistants reinforcing harmful gender stereotypes?