A Montreal-based AI startup called Lyrebird has taken the wraps off a voice imitation algorithm that the team says can not only mimic the speech of a real person but shift its emotional cadence — and do all this with just a tiny snippet of real world audio.
The public demo, released online yesterday, consists of a series audio samples of (fake) speech generated using their algorithm and one minute voice samples of the speakers. They’ve used voice samples from Presidents Trump, Obama and Hillary Clinton to demo the tech in action — and for maximum FAKE NEWS impact, obviously.
Canadian startup mashes up speech sampling, AI, and deep-learning with incredible results and more than a few ramifications
The word on the street is that there’s a new API in advanced development which can fully replicate any voice with only a minute’s worth of recorded input.
But as it approaches launch, many will ask what the actual uses for such an idea might be? And, more importantly, question the potential repercussions?
Before we tuck in to that though, are we all familiar with the South Australian Lyrebird? This particular master of mimicry is capable of perfectly copying the bird songs they hear all around them in order to attract a mate. And their skills are so adept that they can even emulate other noises they hear such as cameras, car alarms and chainsaws (seriously, it’s amazing, check out this classic clip from the BBC).
It’s quite fitting then that a Montréal-based AI startup chose the name of the Lyrebird for their voice-imitating algorithm. Lyrebird is capable of fully mimicking a person’s voice (even with added emphasis and emotion) after an analysis of only a minute’s worth of audio recording.
Lyrebird – First impressions
Lyrebird was founded by three University of Montreal PhD students; Jose Sotelo, Kundan Kumar and Alexandre de Brébisson. Kumar and Sotelo worked together on a research paper that looked at using neural networks to generate audio from a series of samples, which then formed the basis for their deep-learning model for speech synthesis.
This idea is nothing new of course, Adobe’s VoCo was announced a few months ago and promised a service that was dubbed ‘Photoshop for speech’, but required at least 20 minutes of original audio. By contrast, Lyrebird says they can pull that off with only a minute’s worth.
Lyrebird states that it can “compress voice DNA into a unique key [and] use this key to generate anything with its corresponding voice”. The company states it’s even capable of controlling the emotion of the voice, adding inflections of anger, sympathy, stress and more.
Demos and examples of just how the algorithm works are available on their website, and as you can hear, the samples aren’t perfect.
Indeed, there is still some way to go. However, it’s easy to see how the technology could eventually be refined to create digital voices that are indistinguishable from the real source, raising some pretty significant concerns for some.
Potential problems with mimicry
At a time where authenticity and accuracy are making us question much of what we see and hear, there are obvious problems when it comes to be able to emulate anyone’s voice. Tools like these could be used for all manner of criminal purposes, and research shows that the early voice authentication systems were easily abused.
But isn’t voice-authentication the silver bullet of security?
That’s a valid quest to which the answer is flatly, “no”. Whilst Lyrebird is far from everyday deployment, there is no reason to believe that, even if they don’t crack it, someone else won’t either. Robust security is and always will require a full quiver for defending systems and accounts from the bad guys.
To their credit though, Lyrebird aren’t shying away from the issue, and address the problem, (and offer a potential solution) on their ethics page:
By releasing our technology publicly and making it available to anyone, we want to ensure that there will be no such risks. We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible. More generally, we want to raise attention about the lack of evidence that audio recordings may represent in the near future.
This does potentially open a whole giant can of worms in terms of what would hold up as evidence in an investigation, but that’s some murky territory which is better left for when (and if) applications like these become more prevalent.
However, Lyrebird themselves hope that developers will put their technology to more benign use. Ideas for the use of the voice API range from customizable personal assistants and connectivity with IoT, entertainment and gaming, speech synthesis for people with certain disabilities and even down to audio books being read by your celebrity of choice.
As Lyrebird’s AI and machine learning capabilities are still in development, there is no current date for when their API will be available, but the team are currently at the ICLR conference in France discussing AI and their work and may reveal more.
And, as soon as we find an app that can top the phenomenal Ms Fey, we’ll bring that to you as soon as we get word.