Martin Heller
Contributing Writer

What are deepfakes? AI that deceives

feature
Sep 15, 20208 mins
Artificial IntelligenceDeep LearningMachine Learning

Deepfakes extend the idea of video compositing with deep learning to make someone appear to say or do something they didnโ€™t really say or do

A virtual face, constructed of binary code.
Credit: Thinkstock

Deepfakes are media โ€” often video but sometimes audio โ€” that were created, altered, or synthesized with the aid of deep learning to attempt to deceive some viewers or listeners into believing a false event or false message.

The original example of a deepfake (by reddit user /u/deepfake) swapped the face of an actress onto the body of a porn performer in a video โ€“ which was, of course, completely unethical, although not initially illegal. Other deepfakes have changed what famous people were saying, or the language they were speaking.

Deepfakes extend the idea of video (or movie) compositing, which has been done for decades. Significant video skills, time, and equipment go into video compositing; video deepfakes require much less skill, time (assuming you have GPUs), and equipment, although they are often unconvincing to careful observers.

How to create deepfakes

Originally, deepfakes relied on autoencoders, a type of unsupervised neural network, and many still do. Some people have refined that technique using GANs (generative adversarial networks). Other machine learning methods have also been used for deepfakes, sometimes in combination with non-machine learning methods, with varying results.

Autoencoders

Essentially, autoencoders for deepfake faces in images run a two-step process. Step one is to use a neural network to extract a face from a source image and encode that into a set of features and possibly a mask, typically using several 2D convolution layers, a couple of dense layers, and a softmax layer. Step two is to use another neural network to decode the features, upscale the generated face, rotate and scale the face as needed, and apply the upscaled face to another image.

Training an autoencoder for deepfake face generation requires a lot of images of the source and target faces from multiple points of view and in varied lighting conditions. Without a GPU, training can take weeks. With GPUs, it goes a lot faster.

GANs

Generative adversarial networks can refine the results of autoencoders, for example, by pitting two neural networks against each other. The generative network tries to create examples that have the same statistics as the original, while the discriminative network tries to detect deviations from the original data distribution.

Training GANs is a time-consuming iterative technique that greatly increases the cost in compute time over autoencoders. Currently, GANs are more appropriate for generating realistic single image frames of imaginary people (e.g. StyleGAN) than for creating deepfake videos. That could change as deep learning hardware becomes faster.

How to detect deepfakes

Early in 2020, a consortium from AWS, Facebook, Microsoft, the Partnership on AIโ€™s Media Integrity Steering Committee, and academics built the Deepfake Detection Challenge (DFDC), which ran on Kaggle for four months.

The contest included two well-documented prototype solutions: an introduction, and a starter kit. The winning solution, by Selim Seferbekov, also has a fairly good writeup.

The details of the solutions will make your eyes cross if youโ€™re not into deep neural networks and image processing. Essentially, the winning solution did frame-by-frame face detection and extracted SSIM (Structural Similarity) index masks. The software extracted the detected faces plus a 30 percent margin, and used EfficientNet B7 pretrained on ImageNet for encoding (classification). The solution is now open source.

Sadly, even the winning solution could only catch about two-thirds of the deepfakes in the DFDC test database.

Deepfake creation and detection applications

One of the best open source video deepfake creation applications is currently Faceswap, which builds on the original deepfake algorithm. It took Ars Technica writer Tim Lee two weeks, using Faceswap, to create a deepfake that swapped the face of Lieutenant Commander Data (Brent Spiner) fromย Star Trek: The Next Generation into a video of Mark Zuckerberg testifying before Congress. As is typical for deepfakes, the result doesnโ€™t pass the sniff test for anyone with significant graphics sophistication. So, the state of the art for deepfakes still isnโ€™t very good, with rare exceptions that depend more on the skill of the โ€œartistโ€ than the technology.

Thatโ€™s somewhat comforting, given that the winning DFDC detection solution isnโ€™t very good, either. Meanwhile, Microsoft has announced, but has not released as of this writing, Microsoft Video Authenticator. Microsoft says that Video Authenticator can analyze a still photo or video to provide a percentage chance, or confidence score, that the media is artificially manipulated.

Video Authenticator was tested against the DFDC dataset; Microsoft hasnโ€™t yet reported how much better it is than Seferbekovโ€™s winning Kaggle solution. It would be typical for an AI contest sponsor to build on and improve on the winning solutions from the contest.

Facebook is also promising a deepfake detector, but plans to keep the source code closed. One problem with open-sourcing deepfake detectors such as Seferbekovโ€™s is that deepfake generation developers can use the detector as the discriminator in a GAN to guarantee that the fake will pass that detector, eventually fueling an AI arms race between deepfake generators and deepfake detectors.

On the audio front, Descript Overdub and Adobeโ€™s demonstrated but as-yet-unreleased VoCo can make text-to-speech close to realistic. You train Overdub for about 10 minutes to create a synthetic version of your own voice; once trained, you can edit your voiceovers as text.

A related technology is Google WaveNet. WaveNet-synthesized voices are more realistic than standard text-to-speech voices, although not quite at the level of natural voices, according to Googleโ€™s own testing. Youโ€™ve heard WaveNet voices if you have used voice output from Google Assistant, Google Search, or Google Translate recently.

Deepfakes and non-consensual pornography

As I mentioned earlier, the original deepfake swapped the face of an actress onto the body of a porn performer in a video. Reddit has since banned the /r/deepfake sub-Reddit that hosted that and other pornographic deepfakes, since most of the content was non-consensual pornography, which is now illegal, at least in some jurisdictions.

Another sub-Reddit for non-pornographic deepfakes still exists at /r/SFWdeepfakes. While the denizens of that sub-Reddit claim theyโ€™re doing good work, youโ€™ll have to judge for yourself whether, say, seeing Joe Bidenโ€™s face badly faked into Rod Serlingโ€™s body has any value โ€” and whether any of the deepfakes there pass the sniff test for credibility. In my opinion, some come close to selling themselves as real; most can charitably be described as crude.

Banning /r/deepfake does not, of course, eliminate non-consensual pornography, which may have multiple motivations, including revenge porn, which is itself a crime in the US. Other sites that have banned non-consensual deepfakes include Gfycat, Twitter, Discord, Google, and Pornhub, and finally (after much foot-dragging) Facebook and Instagram.

In California, individuals targeted by sexually explicit deepfake content made without their consent have a cause of action against the contentโ€™s creator. Also in California, the distribution of malicious deepfake audio or visual media targeting a candidate running for public office within 60 days of their election is prohibited. China requires that deepfakes be clearly labeled as such.

Deepfakes in politics

Many other jurisdictions lack laws against political deepfakes. That can be troubling, especially when high-quality deepfakes of political figures make it into wide distribution. Would a deepfake of Nancy Pelosi be worse than the conventionally slowed-down video of Pelosi manipulated to make it sound like she was slurring her words? It could be, if produced well. For example, see this video from CNN, which concentrates on deepfakes relevant to the 2020 presidential campaign.

Deepfakes as excuses

โ€œItโ€™s a deepfakeโ€ is also a possible excuse for politicians whose real, embarrassing videos have leaked out. That recently happened (or allegedly happened) in Malaysia when a gay sex tape was dismissed as a deepfake by the Minister of Economic Affairs, even though the other man shown in the tape swore it was real.

On the flip side, the distribution of a probable amateur deepfake of the ailing President Ali Bongo of Gabon was a contributing factor to a subsequent military coup against Bongo. The deepfake video tipped off the military that something was wrong, even more than Bongoโ€™s extended absence from the media.

More deepfake examples

A recent deepfake video of All Star, the 1999 Smash Mouth classic, is an example of manipulating video (in this case, a mashup from popular movies) to fake lip synching. The creator, YouTube user ontyj, notes he โ€œGot carried away testing out wav2lip and now this existsโ€ฆโ€ Itโ€™s amusing, although not convincing. Nevertheless, it demonstrates how much better faking lip motion has gotten. A few years ago, unnatural lip motion was usually a dead giveaway of a faked video.

It could be worse. Have a look at this deepfake video of President Obama as the target and Jordan Peele as the driver. Now imagine that it didnโ€™t include any context revealing it as fake, and included an incendiary call to action.

Are you terrified yet?

Read more about machine learning and deep learning:

Martin Heller

Martin Heller is a contributing writer at InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. From 2010 to August of 2012, Martin was vice president of technology and education at Alpha Software. From March 2013 to January 2014, he was chairman of Tubifi, maker of a cloud-based video editor, having previously served as CEO.

Martin is the author or co-author of nearly a dozen PC software packages and half a dozen Web applications. He is also the author of several books on Windows programming. As a consultant, Martin has worked with companies of all sizes to design, develop, improve, and/or debug Windows, web, and database applications, and has performed strategic business consulting for high-tech corporations ranging from tiny to Fortune 100 and from local to multinational.

Martinโ€™s specialties include programming languages C++, Python, C#, JavaScript, and SQL, and databases PostgreSQL, MySQL, Microsoft SQL Server, Oracle Database, Google Cloud Spanner, CockroachDB, MongoDB, Cassandra, and Couchbase. He writes about software development, data management, analytics, AI, and machine learning, contributing technology analyses, explainers, how-to articles, and hands-on reviews of software development tools, data platforms, AI models, machine learning libraries, and much more.

More from this author