On Sunday, White House Deputy Chief of Staff Dan Scavino tweeted out a viral video doctored to look like Democratic presidential candidate Joe Biden falling asleep during an interview; less than 24 hour later, Twitter branded it as "manipulated media." The footage, having originated from an older interview with actor Harry Belafonte, was far too easily debunked.
Welcome to the world of deepfakes, which uses artificial intelligence to create media showing people doing or saying things they never did. Despite advances in technology, video deepfakes still haven't achieved the level of sophistication to fool an already skeptical public. But while everyone is focused on video, they should be paying attention to audio deepfakes, which have become incredibly advanced and could pose a more immediate threat to the spread of disinformation.
This is the crux of the latest episode of "Twenty Thousand Hertz," a TED-affiliated podcast that explores "the world's most interesting sounds." In the episode, host Dallas Taylor delves into how such artificial voices are made by creating his very own "Deepfake Dallas" voice, and explores the various ways that audio deepfakes could be used, from harmless goofing around to scams or even political gain.
In an interview with Salon, Taylor said, "We're over here doing this in good fun, but I see a way that if I had malicious intent, I could change the course of our country with audio just by leaking it to the right people at the right time. Someone could say that audio leaked of a president saying something . . . That is terrifying, that someone can do that and affect the entire course of our democracy in one fell swoop."
How easy is it to make a deepfake voice?
YouTube is full of deepfakes of varying degrees of quality, but the podcast spotlights the "Speaking to AI" channel where you can listen to various celebrities say incongruous things, such Joe Biden telling a rambling story about tying an onion to his belt. Although the tale isn't quite as riveting as his showdown with a razor blade-wielding CornPop, hearing Biden's voice coming out of Grandpa from "The Simpsons" is delightful, albeit a little unsettling.
None of the videos on "Speaking of AI" would fool anyone; mixing "The Simpsons" with any other medium is a huge tip-off that this shouldn't be taken seriously. However, if you take the visuals away, that leaves the high quality of the cloned voice, which can be programmed to say anything. In skilled hands, that could be terrifyingly convincing.
Taylor enlisted the help of "Speaking of AI" creator Tim McSmythurs to make the "Deepfake Dallas" voice, a process that required compiling hours of Taylor speaking from past episodes of "Twenty Thousand Hertz" to obtain clean audio without any music, effects, or other interfering sounds. They also collected corresponding transcripts.
"Tim then goes through their magic, sending it through all of these learning machine models and then this back and forth process to slowly make it better and better and better," said Taylor.
The result, which can be heard in the episode, is seamless as the real Dallas Taylor argues with "Deepfake Dallas" over hosting the show. Sometimes, it's difficult to distinguish between man and machine, until context clues are given. It's silly but also a little bit creepy.
Most deepfake voices aren't dipping into the well of America's podcasts hosts, though. That's usually reserved for well-known people who've provided plenty of audio material to the public: characters in TV shows and movies or celebrities, especially politicians.
"Why they're used so much is because we have a lot of clean audio from these people where there's not music and stuff behind it," said Taylor. "Where do we have verbatim transcripts of clean recordings? Anyone who's ever read their own audio book. You can literally copy and paste and make this stuff in no time. Who writes these audio books? Politicians influencers, people who who can change the course of history with the way that they say things. I find the entire idea of deepfakes terrifying."
Taking the deepfake to the next level
Having a deepfake voice at one's command isn't convincing enough to do damage though. Whatever they're programmed to say can still sometimes fall into that "uncanny valley" that doesn't quite seem human, whether it's a robotic sound, weird cadence, or lack of real-world atmosphere.
That's where sound design can fill in those gaps and boost the credibility of faked audio. Taylor, who also runs audio design studio Defacto Sound, has thought through some of the problems.
"I have never heard anyone really ever speak about the power of sound designers," he said, "but every time I hear a little uncanny valley, I can change the line to make it a little bit more nuanced or change the wording to make it a little bit cleaner. If we just can't make something get out of that uncanny valley, we can put a mic rub right there to mask it.
"We can make that audio sound like it's coming from a telephone or from a bush. We did a whole Watergate episode, and so we could fake a president's voice and make it sound like it was just a secret recording in someone's pocket, and they walked into the Oval Office. I'm pretty confident we could sway the public and make them think that it was absolutely real with enough of a decent backstory."
Taylor started "Twenty Thousand Hertz" to bring the public's attention to sound – not only because it's overlooked, but because it's already a powerful presence in every part of our lives. That's precisely why the audio deepfake could have a big impact.
"Culturally, we treat sound like it's this mysterious thing that only a select few has the ability to actually craft or understand or write about or learn about," said Taylor. "Whereas I'm trying to race against the clock to tell people, 'No, it's not.' It's just one of those senses that we're just letting be a mystery.
"A really sharp talented sound designer can do this, and I could do this. You don't see it happen yet because you don't have a bunch of malicious sound designers that are going to go rogue and do this. With our world of more information getting out, more people understanding how to do this, it's only a matter of time. You could YouTube your way to being a great sound designer."
The dangers of the audio deepfake
Deepfake voices have already been wielded with ill intent. The "Twenty Thousand Hertz" episode opens with a well-known phone scam that uses a fake CEO voice to fool employees into wiring money into scammers' accounts. Synthetic voices could also mimic relatives on the phone asking for passcodes or other sensitive information. But beyond these individual scams are the potential for doing harm on a greater scale.
In the scenario the podcast imagines, damning deepfake audio of a politician could be unleashed and then spread online in minutes. Rianna Pfefferkorn, Associate Director of Surveillance and Cybersecurity at Stanford, says that such deepfake propaganda could be a "major vector to try to influence populations, influence votes."
Plus, in our hyper-partisan world, the public is primed to believe the worst. Depending on the seriousness of the propaganda, it could do some major damage.
"This technology paired with culturally how we very quickly make determinations without without really vetting the details of something is kind of a powder keg ready to explode," said Taylor. "Even if someone has debunked it in 24 hours, the entire world could come crumbling down before that happens. We don't like to be unconvinced of our opinions, so once it's out there, even if something gets debunked, something in our human nature says we want to continue to believe the thing that we believed yesterday."
The flip side of a world primed to view information with suspicion though, means that anyone can claim something that actually happened, did not. Even live, in-the-moment video and audio has been denied, such a the GOP House candidate who tried to claim that the George Floyd killing was a deepfake hoax.
"People can say things and then claim that it was made up, which has already happened on the highest scales. [They] now have complete plausible deniability on anything that they ever say," said Taylor.
Even though the misuse of deepfakes has clear political implications, "Twenty Thousand Hertz" has made a point to not be a political show or even one that's hooked into current events as a "silent protest against how just terrible like the news cycle and politics is." Past episodes have focused on a range of topics from ASMR and Stradivarius violins to the Netflix "ta-dum!" and whoopee cushions. Yes, a whole episode on that flatulence-faking toy.
Why then, skate along the edge of politics with the deepfake episode, which could potentially hand the tools of deception to listeners? In this case, time was of the essence.
"The way that I thought about it was is we need to put this out sooner rather than later. So hopefully this snowball of understanding can get out there before it happens, rather than we just find ourselves in it. It could happen within the next few months," said Taylor. "This is not a matter of if it's going to happen, it will absolutely happen. This show is wrapped in a ribbon of good fun with a very sinister undertone on purpose."
Listen to the "Twenty Thousand Hertz" episode "Deepfake Dallas" below or wherever you get your podcasts: