Adobe is well known for its graphics and video tools, and Photoshop and Premiere have been key components of the creative workflow for many years. Its audio tools, on the other hand, have lagged behind somewhat. Audition is a pretty powerful multitrack wave editor, but sound has still always played second fiddle to visuals for Adobe. One upshot of this relentless focus on graphics is that there’s actually not a huge amount of new territory for Photoshop to explore: it’s already extremely powerful. So Adobe is finally turning some of its engineering guns towards developing its audio tools.
VoCo (voice conversion) is a tool that analyses your speech and breaks down the sound into phonemes, which are the individual sounds that you use to form words. Once it’s used these to build its database, you can record new clips and the technology will translate the sound into text, which you can then edit. Moving words around in a sentence isn’t exactly revolutionary—you can do it with any wave editor—but what’s clever is the ability to resynthesize new words that weren’t part of the original clip, using the person’s voice.
A sneak peek of Adobe VoCo
We’ve seen stuff a bit like this before: Vocaloid can do text-to-singing using a preset voice and Dragon Dictate claims to be able to transcribe speech to text after learning your voice, though when I tried it, the results were something of a disaster. And, of course, Melodyne and the sample editors of many DAWs now have advanced vocal manipulation tools, even if they’re not quite as ambitious as this.
What’s really interesting about this technology, potentially at least, is that it creates something from nothing. It’s not so much vocal editing as vocal synthesis, but it’s supposed to sound like the real person rather than some preset voice. The implications, if Adobe can pull it off, are huge. At the moment the results would appear to be too clunky to really fool anyone, but remember that this is a preview of the technology and by the time it is actually released it’ll be a good deal smoother.
Some people have raised concerns about the possibility of this technology being used maliciously, and it’s a valid point. Just like you can Photoshop someone doing something they never did, so you will be able to make someone say something they never said. If the tech is good enough, it’ll sound real. Adobe surely understand this and hinted at a watermark-style solution they're working on.
For producers it’s particularly interesting because it raises the spectre of being able not just to pitch correct vocals but completely resynthesize them. At present you can mess with pitch, timing and formants but not literally create new words from nowhere. So you could in theory change entire lyrics long after the vocalist has gone home. Or change the way someone pronounces a word. That’s a remarkable prospect. It will all hinge on how seamless the tech ends up being of course, but this has the potential to be a massive step forward in the field of digital audio.