Discover 4 Deepfake technologies for a better cyber resilience

This section takes an in-depth look at the most advanced methods for generating deepfakes. A review of each category of deepfake is provided to give a deeper understanding of the different approaches.
Untitled design 6 1

Types of deepfakes

This section takes an in-depth look at the most advanced methods for generating deepfakes. A review of each category of deepfake is provided to give a deeper understanding of the different approaches.

Numerous models have been created for video manipulation. Different variants and combinations of GANs and encoder-decoder architectures are used to manipulate both audio and video. First, the facial region is detected and then cropped, before translating both the target face and the source data into intermediate representations such as deep features, facial landmarks, UV maps and morphable 3D model parameters. The intermediate representations are then passed to different synthesis models, or combinations of models, such as GAN, encoder-decoder, Pix2Pix network and RNN/LSTM. Finally, the output is acquired by re-rendering the generated face in the target frame.

Visual manipulations

Visual manipulation is nothing new; images and videos have been faked since the early days of photography. In face-swapping, or face substitution, the face of the person in the source video is replaced by the face in the target video. Traditional face-swap methods usually follow three steps to perform a face-swap operation. First, these tools detect the face in the source images and then select a candidate face image from the face library that is similar to the appearance and pose of the input face. Second, the method replaces the eyes, nose and mouth of the face and adjusts the lighting and colour of the candidate face image to match the appearance of the input images, and seamlessly blends the two faces. Finally, in the third step, the blended candidate image is positioned by calculating the matching distance in the overlap region. In general, these methods give good results, but they have two important limitations. First, they completely replace the input face with the target face, and the expressions of the input face image are lost. Secondly, the synthetic result is very rigid, and the substituted face looks unnatural, i.e. it requires a matching pose to generate good results.

Recently, DL-based approaches have become popular for synthetic media creation due to their realistic results. Recent deepfakes have demonstrated how these approaches can be applied with automated digital media manipulation. In 2017, the first deepfake video to appear online was created using a face-swapping approach, where a celebrity’s face was displayed on pornographic content. This approach used a neural network to transform a victim’s face into the features of another person while preserving the original facial expression. Over time, face-swapping software, such as FakeApp and FaceSwap, made it easier and faster to produce deepfakes with more convincing results by replacing the face in a video.

These methods typically use two encoder-decoder pairs. In this technique, an encoder is used to extract the latent features of the face from the image and then the decoder is used to reconstruct the face. To exchange faces between the source and target image, two pairs of encoder and decoder are required, where each encoder is first trained on the source image and then on the target image. Once training is complete, the decoders are swapped, so that an original encoder from the source image and a decoder from the target image are used to regenerate the target image with the characteristics of the source image. The resulting image has the source face on the target face, maintaining the target’s facial expressions.

The recently launched ZAO , REFACE and FakeApp applications are popular because of their effectiveness in producing realistic deepfakes based on face swapping. FakeApp allows selective modification of parts of the face. ZAO and REFACE have recently gone viral, used by less tech-savvy users to swap their faces with movie stars and embed themselves in well-known films and TV clips. There are many public implementations of face swap technology using deep neural networks, such as FaceSwap, DeepFaceLab and FaceSwapGAN, which have led to the creation of a growing number of synthesized multimedia clips.

Until recently, most research focused on advances in face-sharing technology, either using a reconstructed 3D morphological model (3DMM), or GAN-based models. Korshunova et al. proposed an approach based on a convolution neural network (CNN) that transferred semantic content, e.g., face pose, facial expression and lighting conditions, from the input image to create the same effects in another image. They introduced a loss function that was a weighted combination of style loss, content loss, light loss and total variation regularization. This method generates more deepfakes, however, it requires a large amount of training data. In addition, the trained model can only be used to transform one image at a time. Nirkin et al. presented a method that used a full convolution network (FCN) for face segmentation and substitution together with a 3DMM to estimate the facial geometry and corresponding texture. Facial reconstruction was then performed on the target image by adjusting the model parameters. These approaches have the limitation of subject- or pair-specific training. Recently, subject-independent approaches have been proposed to address this limitation.

The addition of VGGFace perceptual loss made the direction of the eyes appear more realistic and consistent with the input, and also helped smooth out the artefacts added in the segmentation mask, resulting in a high-quality output video. FSGAN made it possible to swap and recreate faces in real time following the recreate-and-blend strategy. This method simultaneously manipulates pose, expression and identity while producing high quality and temporally consistent results. These GAN-based approaches outperform several existing autoencoder-decoder methods, as they work without being explicitly trained on specific subject images. Moreover, their iterative nature makes them suitable for facial manipulation tasks such as generating realistic images of fake faces.

Some work used a disentangling concept for face swapping by using VAEs. RSGAN employed two separate VAEs to encode the latent representation of the facial and capillary regions respectively. Both encoders were conditioned to predict the attributes describing the target identity. Another approach, FSNet , presented a framework for achieving face swapping using a latent space, to separately encode the face region of the source identity and the landmarks of the target identity, which were then combined to generate the swapped face. However, these approaches do not preserve target attributes such as occlusion and lighting conditions well.

Facial occlusions are always a challenge for face swapping methods. In many cases, the facial region at the source or at the destination is partially covered by hair, glasses, a hand or some other object. This causes visual artefacts and inconsistencies in the resulting image. FaceShifter generates a face swapped with high fidelity and preserves target attributes such as pose, expression and occlusion. The identity encoder was used to encode the source identity and target attributes, resulting in feature maps using the U-Net decoder. These encoded features are passed to a novel generator with cascaded Adaptive Attentional Adaptive Denormalization layers within residual blocks that adaptively adjust the identity region and target attributes. Finally, another network is used to correct occlusion inconsistencies and refine the results. Table 3 presents the details of the Face-swap-based deepfake creation approaches.

Lip synchronization

The lip-sync method involves synthesizing a video of a target identity so that the mouth region in the manipulated video is consistent with a specific audio input. A key aspect of synthesizing a video with an audio segment is the movement and appearance of the lower part of the mouth and its surrounding region. To convey a message more effectively and naturally, it is important to generate appropriate lip movements along with the expressions. From a scientific point of view, lip-synchronization has many applications in the entertainment industry, such as creating photorealistic digital characters in films or games, voice robots and dubbing films in foreign languages. In addition, it can also help the hearing impaired to understand a scenario by lip-reading a video created with authentic audio.

Existing work on lip-synchronization requires the reselection of frames from a video or transcript, along with the target emotions, to synthesize lip movements. These approaches are limited to a particular emotional state and do not generalize well to unseen faces. However, DL models are able to learn and predict movements from audio features. Suwajanakorn et al. propose a method to generate a photorealistic lip-synchronized video using a video of the target and an arbitrary audio clip as input. A recurrent neural network (RNN)-based model is used to learn the correspondence between audio features and the shape of the mouth in each frame, and then frame reselection is used to fill in the texture around the mouth based on the landmarks. This synthesis is performed on the lower facial regions, i.e. the mouth, chin, nose and cheeks, and applies a series of post-processing steps, such as smoothing the location of the jaw and reprogramming the video to align vocal pauses, or the movement of the talking head, to produce videos that appear more natural and realistic. This model requires retraining and a large amount of data for each individual.

The Speech2Vid model takes an audio clip and a static image of a subject as input and generates a video synchronized with the audio clip. This model uses Mel frequency cepstral coefficient (MFCC) features, extracted from the audio input, and feeds them into a CNN-based encoder-decoder. As a post-processing step, a separate CNN is used to blur and sharpen the frames in order to preserve the quality of the visual content. This model generalizes well to unknown faces and therefore does not need to be retrained for new identities. However, this work is unable to synthesize a variety of emotions in facial expressions.

Puppet-master

The Puppet-master generation method, also known as facial recreation, is another common variety of deepfake that manipulates a person’s facial expressions, for example, by transferring facial gestures, eye and head movements, to an output video that mirrors those of the source actor. Facial recreation has various applications, such as altering a participant’s facial expression and mouth movement to a foreign language in an online multilingual video conference, dubbing or editing an actor’s head and facial expressions in the film industry’s post-production systems, or creating photorealistic animation for movies and games, etc.

Initially, 3D facial modelling approaches were proposed for facial recreation because of their ability to accurately capture geometry and motion, and to improve photorealism in recreated faces. Thies et al. presented the first method of transferring facial expressions in real time from an actor to a target person. A basic RGB-D sensor was used to track and reconstruct the 3D model of the source and target actors. For each frame, the tracked deformations of the source face were applied to the target face model, and then the altered face was blended with the original target face while preserving the facial appearance of the target face model. Face2Face is an advanced form of facial recreation technique. This method worked in real time and was able to alter the facial movements of generic RGB video sequences, e.g. YouTube videos, using a standard webcam. The 3D model reconstruction method was combined with image rendering techniques to generate the result. In this way, a convincing and instantaneous re-rendering of a target actor could be created with a relatively simple home setup. This work was extended to control the facial expressions of a person in a target video from intuitive hand gestures using an inertial measurement unit. 

Subsequently, GANs were successfully applied to facial recreation due to their ability to generate photorealistic images. Pix2pixHD produced high-resolution images with higher fidelity by combining a conditional multiscale GAN (cGAN) architecture with perceptual loss. Kim et al. proposed an approach that allowed complete re-animation of portrait videos by an actor, such as changing head pose, eye gaze and eye blink, rather than just modifying the facial expression of the target identity, thus producing photorealistic dubbing results. First, a facial reconstruction method was used to obtain a parametric representation of the face and lighting information from each video frame to produce a synthetic representation of the target identity. This representation was then fed into a cGAN-based render-to-video translation network to predict the synthetic rendering in photorealistic video frames. This approach required training the videos for target identity. Wu et al. proposed ReenactGAN, which encodes the input facial features in a boundary latent space. A target-specific transformer was used to adapt the source boundary space according to the specified target, and then the latent space was decoded into the target face. GANimation employed a dual cGAN generator conditioned on emotion action units (AUs) to transfer facial expressions. The AU-based generator used an attention map to interpolate between the recreated and original images. Instead of relying on UA estimates, GANnotation used facial landmarks, together with a self-attention mechanism, for facial reconstruction. This approach introduced a triple coherence loss to minimize visual artefacts, but required images to be synthesized with a frontal facial view for further processing. These models required a large amount of training data to make the target identity work well at oblique angles or lacked the ability to generate photorealistic recreations for unknown identities.

Recently, single- or few-shot facial recreation approaches have been proposed to achieve recreation using a few, or even a single, source image. A self-supervised learning model, X2face, has been proposed that uses multiple modalities, such as driving frame, facial landmarks or audio, to transfer the pose and expression from the input source to the target expression. X2face uses two encoder-decoder networks: an embedding network and a conduction network. The embedding network learns the face representation from the source frame and the driving network learns the pose and expression information from the driving fame to the vector map. The driving network was created to interpolate the face representation from the embedded network to produce target expressions. Zakharov et al. present a meta-transfer learning approach in which the network was first trained with multiple identities and then tuned to the target identity. First, the target identity encoding is obtained by averaging the target expressions and associated landmarks from different frames. Then, a pix2pixHD GAN was used to generate the target identity using the source landmarks as input and the identity encoding using adaptive instance normalization (AdaIN) layers. This approach works well at oblique angles and directly transfers the expression without requiring an intermediate latent boundary space or an interpolation map. Zhang et al. propose an autoencoder-based framework to learn the latent representation of the target’s facial appearance and the shape of the source face. These features are used as input to the residual SPADE blocks for the face reconstruction task, which preserves the spatial information and concatenates the multi-scale shape feature map from the face reconstruction decoder. This approach can better handle large pose changes and exaggerated facial actions. In FaR-GAN , the learnable features from the convolution layers are used as input to the SPADE module instead of using multiscale landmark masks, as in . Normally, few-shot learning fails to fully preserve the source identity in the generated results for cases where there is a large pose difference between the reference and target image. MarioNETte is proposed to mitigate the identity leakage by employing attention block and target feature alignment. This helps the model to better accommodate variations between facial structures. Finally, identity is preserved using a novel landmark transformer, influenced by the 3DMM face model.

Real-time facial recreation approaches, such as FSGAN , perform both facial replacement and occlusion-managed recreation. For recreation, a realistic pix2pixHD deepfake is used. Videos generated using the above techniques will be fused with fake audio to create completely invented content. These progressions allow for real-time manipulation of facial expressions and movement in the videos, while making it difficult to distinguish between what is real and what is fake.

Facial synthesis

Facial editing in digital images has been intensively explored for decades. It has been widely adopted in the art, animation and entertainment industries, although more recently it has been exploited to create deepfakes for impersonation purposes. Face generation involves the synthesis of photorealistic images of a human face that may or may not exist in real life. The tremendous evolution of deep generative models has made them widely adopted tools for facial image synthesis and editing. Deep learning generative models, i.e., GAN and VAE , have been successfully used to generate fake photorealistic images of human faces. In facial synthesis, the goal is to generate non-existent but realistic-looking faces. Facial synthesis has enabled a wide range of beneficial applications, such as automatic character creation for video games and 3D facial modelling industries. AI-based facial synthesis could also be used for malicious purposes, such as synthesizing a fake, photorealistic profile picture for a fake social network account to spread disinformation. Several approaches have been proposed to generate realistic-looking facial images that humans cannot recognize as synthesized.

Since the advent of GANs in 2014, significant efforts have been made to improve the quality of the synthesized images. The images generated with the first GAN model were of low resolution and unconvincing. DCGAN was the first approach that introduced a deconvolution layer in the generator to replace the fully connected layer, which achieved better performance in synthetic image generation. Liu et al. proposed CoGAN, based on VAE, for learning joint distributions of two-domain images. This model trained a pair of GANs instead of a single one, and each was responsible for synthesizing images in one domain. The size of the generated images remained relatively small, e.g. 64 × 64 or 128 × 128 pixels.

The generation of high-resolution images was previously limited due to memory constraints. Karras et al. presented ProGAN, a training methodology for GANs, which employed an adaptive minilot size that progressively increased the resolution, depending on the current output resolution, by adding layers to the networks during the training process. StyleGAN was an improved version of ProGAN . Instead of mapping the latent code z to a resolution, a mapping network was used that learned to map the input latent vector (Z) to an intermediate latent vector (W) that controlled different visual features. The improvement was that the intermediate latent vector was free of any distribution constraints, which reduced the correlation between features (de-interlacing). The layers of the generator network were controlled by an AdaIN operation that helped to decide the features in the output layer. StyleGAN achieved state-of-the-art high resolution in the generated images, i.e. 1024 × 1024, with fine details. StyleGAN2 further improved the perceived image quality by removing unwanted artefacts, such as change in gaze direction and alignment of teeth with facial pose. Huang et al. presented a two-way generative adversarial network (TPGAN) that could simultaneously perceive global structures and local details, like humans, and synthesized a high-resolution frontal facial image from a single ill-posed facial image. Image synthesis using this approach preserved identity under large pose and illumination variations. Zhang et al. introduced a self-attenuation module in convolutional GANs (SAGAN) to handle global dependencies to ensure that the discriminator can accurately determine related features in distant regions of the image.

This work further improved the semantic quality of the generated image. The BigGAN architecture used residual networks to improve the fidelity of the image and the variety of the generated samples by increasing the batch size and varying the latent distribution. In BigGAN, the latent distribution was embedded in multiple layers of the generator to influence the features at different resolutions and levels of the hierarchy, rather than simply being added to the initial layer. Thus, the generated images were photorealistic and closely resembled real-world images from the ImageNet dataset. Zhang et al. proposed a stacked GAN (StackGAN) model to generate high-resolution images (e.g. 256 × 256) with details based on a given textual description.


Latest News