Computer-generated speech has just witnessed a revolution. Recently, a number of voice clips, cloning Bill Gates’ voice were demonstrated. In these clips, we can listen to some phrases in Microsoft founder’s voice.
The clips were created by MelNet, a machine learning system developed by engineers at Facebook. Moreover, Bill Gates is not the only one whose voice was cloned. George Takei, Stephen Hawking and a number of other renowned individuals also got their voices cloned by MelNet.
"He said the same phrase thirty times."
"2 + 7 is less than ten."
According to the research paper published by two researchers Sean Vasquez and Mike Lewis, the training data used in MelNet were a 452-hour TED Talks’ dataset, in addition to certain audiobooks.
Interestingly, the voice quality has impressively increased over the last few years. 2016 was a breakthrough year in regards to this technology with the introduction of SampleRNN and WaveNet. In case you are unaware, WaveNet now runs the Google Assistant and is a machine learning text-to-speech program developed by DeepMind, Google’s AI lab situated in London.
WaveNet, SampleRNN and other similar tools are supposed to help in studying different tones of a human voice. This is done by incorporating loads of data into the AU system, unlike the earlier text-to-speech systems that basically reconstructed audio instead of producing it.
MelNet doesn’t rely on waveforms. It uses spectrogram for learning to speak. Facebook researchers mentioned that MelNet’s “high-level structure” capturing game is on-point, to such a degree that even the delicate regularities in a human voice are accounted for. The primary reason for this is that the data captured in a spectrogram is much denser than that captured in audio waveforms, which results in consistent voices.
However, there’s still a long way to go with this technology as the model can still not implement the changes in voice that transpire over a specific period of time or changes in tone over a specific passage to indicate tension/drama. The same problem can also be found with AI text generation.
Regardless, MelNet is a multi-functional system and has produced impressive results over a short period of time. It can also be used to create music (although that area needs a lot of improvement).
Although this technology can be beneficial in many ways (AI assistants, helping people with speech impairments etc.), it can be used for questionable and dangerous acts as well (tampering with evidence, audio harassment, scams etc.). Thus, possibilities are endless and it’s up to each and every one of us how we utilize this technology for the betterment of this world.
Photo: United States Department of Energy
Read next: Now it is Possible To Change the Video Speeches of Humans by Changing the Transcript Only
The clips were created by MelNet, a machine learning system developed by engineers at Facebook. Moreover, Bill Gates is not the only one whose voice was cloned. George Takei, Stephen Hawking and a number of other renowned individuals also got their voices cloned by MelNet.
"He said the same phrase thirty times."
"2 + 7 is less than ten."
According to the research paper published by two researchers Sean Vasquez and Mike Lewis, the training data used in MelNet were a 452-hour TED Talks’ dataset, in addition to certain audiobooks.
Interestingly, the voice quality has impressively increased over the last few years. 2016 was a breakthrough year in regards to this technology with the introduction of SampleRNN and WaveNet. In case you are unaware, WaveNet now runs the Google Assistant and is a machine learning text-to-speech program developed by DeepMind, Google’s AI lab situated in London.
WaveNet, SampleRNN and other similar tools are supposed to help in studying different tones of a human voice. This is done by incorporating loads of data into the AU system, unlike the earlier text-to-speech systems that basically reconstructed audio instead of producing it.
MelNet doesn’t rely on waveforms. It uses spectrogram for learning to speak. Facebook researchers mentioned that MelNet’s “high-level structure” capturing game is on-point, to such a degree that even the delicate regularities in a human voice are accounted for. The primary reason for this is that the data captured in a spectrogram is much denser than that captured in audio waveforms, which results in consistent voices.
However, there’s still a long way to go with this technology as the model can still not implement the changes in voice that transpire over a specific period of time or changes in tone over a specific passage to indicate tension/drama. The same problem can also be found with AI text generation.
Regardless, MelNet is a multi-functional system and has produced impressive results over a short period of time. It can also be used to create music (although that area needs a lot of improvement).
Although this technology can be beneficial in many ways (AI assistants, helping people with speech impairments etc.), it can be used for questionable and dangerous acts as well (tampering with evidence, audio harassment, scams etc.). Thus, possibilities are endless and it’s up to each and every one of us how we utilize this technology for the betterment of this world.
Photo: United States Department of Energy
Read next: Now it is Possible To Change the Video Speeches of Humans by Changing the Transcript Only