Following the recent move to integrate OpenAI’s chatGPT with its products, Microsoft has recently released VALL-E, a new language model for text-to-speech synthesis (TTS) that uses audio codec codes to represent intermediate representations. (via Aitopics)
The technology generates content using 3-second samples of particular voices after being trained on 60,000 hours of English speech data. VALL-E, as opposed to many other AI tools, can accurately capture the mood and tone of a speaker—even when recording words that the original speaker never uttered.
The quality of the voice samples provided by Microsoft varies. While some of them have a natural tone, others have a robotic sound that is unmistakably artificial. Naturally, AI improves over time, so created recordings in the future will probably be more believable. VALL-E basically just uses 3-second recordings as prompts. Unquestionably, the technique could produce more realistic samples if it were applied to a larger sample collection.
Interestingly, researchers from Cornell University in a research paper said “Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.” The result of their experiments shows that Vall-E significantly outperforms the state-of-the-art zero-shot (Text-To-Speech) TTS system in terms of speech naturalness and speaker similarity. They also found out that, Vall-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in the synthesis. Some examples of the work are available here on GitHub.
VALL-E – Amazing and Scary
Politicians and other public personalities may be impersonated, thus it’s doubtful that many people would pause to inquire as to the authenticity of a sensational tape, especially if it sounded at least partially real. Given the divisive nature of political discourse on social media.
Also is a threat to jobs. For example, some have raised the possibility that VALL-E and similar technology would cost voice actors their jobs. Businesses will likely employ this if the technology develops to the point where it can take the place of voice actors for audiobooks or other material.
VALL-E is unquestionably amazing, but it also poses some ethical questions. The voices produced by this AI tool and equivalent technologies will sound increasingly believable as artificial intelligence advances. That would make it possible for spam calls to sound like actual persons a potential victim knows.
Despite the fact that Microsoft has an ethics statement about the use of VALL-E, its future use is still uncertain. VALL-E is currently not widely accessible, which may be a good thing given the potential dangers associated with it, especially in the hands of those with bad intent in using AI-generated replications of human voices.