To Deepfake Is Human, to Detect Is Divine

This article dives into the history of deception, deepfake detection techniques, risk analysis methods, and some early responses from across the globe to help contextualise the challenge posed by deepfakes. 

India is rapidly digitising. There are good things and bad, speed-bumps on the way and caveats to be mindful of. The weekly column Terminal focuses on all that is connected and is not – on digital issues, policy, ideas and themes dominating the conversation in India and the world.

The recent deepfake video targeting the actor Rashmika Mandanna has attracted widespread attention from the media as well as the government. This article dives into the history of deception, deepfake detection techniques, risk analysis methods, and some early responses from across the globe to help contextualise the challenge posed by deepfakes. 

History of deception 

Deepfakes, cheapfakes, and other automatically created content are collectively called synthetic media. Programmatically altered media is not a new threat: the moral panic triggered by the mass-market availability of Photoshop single-handedly paved the way for visual literacy research. In fact, video manipulation using special effects has existed ever since Alfred Clark dramatised Mary Stuart’s execution on film in 1895. Today, Hollywood studios are some of the undisputed leaders of synthetic content generation. Manipulation and media go back further; to Aeschylus when he had Apollo descend from the sky. 

The art of deception is as old as humanity itself. The science – much less so. Despite practicing deception, we do not like being targets except in specific circumstances. Our search for a recipe for detection has been documented since antiquity: the Yajurveda, composed c. 1000 BCE, famously provides one: 

A person who intends to poison food may be recognised. He does not answer questions, or they are evasive answers; he speaks nonsense, rubs the great toe along the ground, and shivers; his face is discoloured; he rubs the roots of the hair with his fingers; and he tries by every means to leave the house. (Chand, D. (1980). The Yajurveda, Sanskrit text with English translation (3rd ed). India: VVRI Press) 

The ossification of the ”cue” canon is rampant despite empirical evidence that ”cues” do not work. What is worse is that we are not great at detection even when trained. This collective experience should serve as a cautionary tale to those of us who demand an immediate solution to deepfakes.  

Synthetic media is economically important 

In 2015, Emmanuel Lubezki picked up his third consecutive Oscar for Best Cinematography for his part in The Revenant. VFX shots made up 122 minutes (about two hours) out of the 156 minutes (about two and a half hours) of that movie. In a recent blog post, the American Society of Cinematographers acknowledged VFX is upending traditional production and the relationship between cinematographers and visual effects experts. In 2021, Amazon announced that you could ask Alexa Amit ji, ek chutkula sunaiye and get a response back in Mr. Bachchan’s voice. In 2023, TikTok launched the Bold Glamour filter that could alter faces in real-time.  

Detection in practice 

Even though compute and storage costs have decreased, production outsourced, and demand increased manifold, the costs of VFX services have increased. This apparent contradiction has one simple explanation: for VFX to be aesthetically pleasing, a crew of diverse specialists is needed. This is a major difference between traditional VFX and generative AI (hereafter GenAI). GenAI promises to do away with human expertise without having to resort to building complex software to emulate nature. This disconnect from the real world offers us an array of techniques to discriminate synthetic media from real ones. Detection research focuses on identifying inconsistencies in the spatial domain, frequency domain, consistency of biological signals, etc. It is worth noting that detection and evasion research reinforce each other. 

Ground truth 

For detection to work, it is important to ascertain a baseline against which to compare the work in question. At every stage of the modern image production and distribution workflow, inputs are altered either for aesthetic reasons or for operational efficiency.  

Figure 1: Metadata of a video downloaded from Twitter shows the file has been processed using the vireo library.

For example, when we capture a scene with our smartphone, the data captured by the image sensor is stabilised, corrected for colour, and then saved in a format requested by the user. Each of these steps will alter the data from the previous stage. In some cases, without these steps, the raw data may not even be of much use. Users can also run their own filtering and editing on the image altering the image data further. When this edited image is uploaded on a social media platform, the platform converts the image to meet its operational goals most efficiently. For example, Twitter encodes video and audio using the vireo library. This definitional ambiguity has direct consequences for evidence collection and due process. 

Spatial analysis 

The recent video impersonating Rashmika Mandanna reveals a few artifacts under careful analysis. We cannot be certain if it was created using a GenAI model or some other program, but we can go through each frame and look for typical artifacts present in GenAI output. Some of the most common errors involve complex anatomical features such as eyes, eyebrows, ears, fingers, and teeth. 

It is typical to mess up the number of fingers, generate multiple sets of teeth, and set eyebrows the wrong way. Figure 2 highlights three errors in the synthetic video: the right eyelid shape is suspect in frame #35; the left eyelid is suspect in frame #55; and the slit in the right ear is suspect in frame #91. The most interesting is of course frame #10: this is where we see the face geometry change. The face swap in frame #10 is a dead giveaway and is unlikely to be present in more carefully crafted attacks. Frame #10 is equally notable for the levitating eyebrows – an error that occurs frequently in AI output. Also note the lack of shadow around the eyes in frame #55 

Blinking is another hard to fake function (at least for the current generation of models). Healthy adults blink once every 2-10 seconds and the typical blink duration is 100-400 milliseconds.  

On average, emulating a single blink requires editing about 9 frames per second in a 30 FPS video. For every blink, these frames would progressively encode the eyelids closing, staying shut, and then opening. Training image data typically contain few images where eyes are closed. Thus, deepfake generating programs find it difficult to emulate natural blinking. Figure 3 shows two such gaze lowering/blinking simulation failures. The unnatural lack of shadow between the eyes and the brow is another artifact. 

A common mistake novice animators make is to forget adding creases to garments. Lack of attention to detail aside, as any creative artist will attest, recreating the appearance of fabric is a challenging task for several reasons. Some of the intrinsic factors that affect folds are material thickness, and stiffness, seam placement, and cuts. Extrinsic factors such as weather, movement, etc. affect the fabric and thereby change folds. GenAI algorithms do not have any innate understanding of fabrics or an individual’s physical appearance, much less what the combined effect should be. So, seams can go missing, folds appear in visibly incorrect order, or the fold geometry may be completely wrong. 

Figure 4 highlights two classes of errors: first, several folds are smoothed out and second, a linear groove on the left becomes a circular depression in the synthetic video. GenAI does not learn about the world the same way humans or animals do. Models may be trained in illumination and facial features from two distinct datasets. Naturally, it becomes challenging for these algorithms to simulate the effect of light and colour on surfaces. Thus, glasses, jewelry, fluids, etc. become difficult to simulate. Figure 5 highlights how the pendant merges with the neck in the altered video. 

Other methods 

Several other analytic methods have been proposed which run statistical analyses on the information present in the video as discriminant. Two such methods are histogram and power spectrum analyses. An emerging class of techniques looks at indicators of liveness such as breathing that can be hard to simulate. 

Risk analysis needs threat models 

To understand threats, we need to understand what motivates threat actors and how attacks may play out. Motivation can be difficult to ascertain. A traditional approach is to consider if the actor is motivated by money, ideology, compromise, or ego; more recent approaches may consider influence factors. Identifying the key elements of a threat – who, what, why, and how – is called threat modeling. As Tom Hanks recently discovered, a celebrity’s likeness may be used to promote a good or service illegally. Journalists can be targeted for several reasons.

One Voice of America journalist was impersonated to promote a cryptocurrency trading platform. In other cases, journalists have been targeted to spread disinformation. A politician contesting an election may be targeted by rivals. Threat models evolve over time too. One example is emerging platforms like Civitai, backed by Andreessen Horowitz, which allow users to pay bounties and outsource generation to third-parties. Such services minimise the skill required to create deepfakes and pose a new class of threat. A more common example of changing threat models can be: if an individual is going through an acrimonious breakup with a romantic partner. For a corporate entity, ransomware operators can implant deepfakes involving executives as an extortionary practice. For a news organisation, a potential attack can degrade the quality of their content library. Nation-state actors may falsify satellite imagery that can impede disaster relief, ecological surveys, and military operations. Policymakers must consider as many threat models as they can. 

Early efforts at mitigation 

Deepfakes represent one form of technology that facilitates gender-based violence. Creepshots, nonconsensually captured imagery (such as via drones), stalkerware are some other technologies of control that require consideration. It is worth noting the diverse set of responses emerging from both market participants (i.e., companies involved in creation and distribution of content) and non-market participants (i.e., non-profits, government, academia, etc.). One early effort was led by Adobe, The New York Times, and Twitter on creating a standard for enhancing attribution of content. Another distribution side intervention is a self-imposed censorship by media companies from reposting the deepfake when reporting to prevent unintended amplification. Additionally, companies have updated their terms of use to prevent creation and/or distribution of deepfakes.

The response from governments globally has been diverse. In a recent case, the Metropolitan police chose not to prosecute at least two cases of audio clips allegedly impersonating the Mayor of London. On remedy, one view from legal scholars is to treat deepfakes as a violation of unregistered trademarks. In the US, the Lanham Act makes this possible. However, common law countries like India which do not allow prosecution for unregistered trademarks may instead allow for action under passing off laws. Others have pointed to Canadian jurisprudence setting an important precedent in R v Jarvis (2019) by formulating the concept of Reasonable Expectation of Privacy even in semi public places. 

Suman Kar is the founder of Banbreach and a cybersecurity practitioner. His twitter handle is @banbreach