← Back to Blog
Research RadarDeepfake DetectionarXivMarch 2026

Monthly arXiv Radar

March 2026 Deepfake Detection Papers: Gaze, Parts, Structured Reasoning, and VLM Semantics

Deepfake detection research in March 2026 is moving past simple artifact spotting. The strongest papers now combine anatomy-aware cues, part-level reasoning, and vision-language semantics to generalize across new generators. That makes this topic especially valuable for SEO because it spans both academic and commercial search intent around deepfake detection, face forgery detection, and AI media trust.

What This Month Signals

The most credible March 2026 trend is forensic specialization: instead of hoping a generic backbone will notice everything, top papers explicitly model gaze, facial parts, or staged reasoning to capture evidence in a more controllable way.

Paper 012026-03-31cs.CV

GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Authors & Institutions

Yaning Zhang

Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), China

Linlin Shen

Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, China

National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China

Shenzhen Institute of Artificial Intelligence and Robotics for Society, China

Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, China

Zitong Yu

School of Computing and Information Technology, Great Bay University, China

Chunjie Ma

Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), China

Zan Gao

Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), China

Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, China

What Problem It Solves

GazeCLIP targets both attribution and detection, asking whether gaze-aware cues can improve generalization to unseen forgery methods.

Key Result

On the authors' benchmark, the method beats prior state of the art by 6.56% average accuracy for attribution and 5.32% AUC for detection under unseen-generator settings.

Abstract

Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

Research Starting Point

Deepfake detectors often overfocus on image appearance and fail badly once a new generator produces artifacts unlike the training set. The authors start from the observation that forged faces also exhibit differences in gaze behavior and gaze preservation, especially across GAN and diffusion pipelines, and that this cue has not been fully exploited. They are motivated by the need to improve both deepfake attribution and detection in a way that generalizes to unseen generators rather than collapsing on the next model release.

Method

GazeCLIP builds a gaze-aware CLIP-style framework in which visual forgery cues and gaze-based prompts are fused into a more stable forensic embedding space. The method introduces a gaze-aware image encoder and a language refinement encoder with adaptive word selection so that the text branch becomes more precise when describing authenticity cues. The paper also constructs a more fine-grained benchmark focused on attribution and detection under novel diffusion and flow-based generators, which strengthens the credibility of its evaluation story.

Paper Summary

The paper is compelling because it adds a new anatomical clue—gaze consistency—to the deepfake detection toolbox, instead of endlessly recycling the same texture-focused paradigm. That shift helps explain why the method improves on unseen generators, not just on familiar datasets. For readers following face forgery defense, GazeCLIP is a strong example of how multimodal reasoning can become practically useful.

Paper 022026-03-27cs.CV

Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection

Authors & Institutions

Kutub Uddin

School of Electronics and Information Engineering, Korea Aerospace University, Goyang, South Korea

Nusrat Tasnim

School of Electronics and Information Engineering, Korea Aerospace University, Goyang, South Korea

Byung Tae Oh

School of Electronics and Information Engineering, Korea Aerospace University, Goyang, South Korea

What Problem It Solves

Face2Parts is designed to capture coarse-to-fine dependencies between the full frame, face crop, and key subregions like eyes, lips, and nose.

Key Result

The paper reports strong average AUC across a wide set of benchmark datasets, including 98.42% on FaceForensics++ and competitive cross-dataset performance on DFDC, DFD, and CDF variants.

Abstract

Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42\% on FF++, 79.80\% on CDF1, 85.34\% on CDF2, 89.41\% on DFD, 84.07\% on DFDC, 95.62\% on DTIM, 80.76\% on PDD, and 100\% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.

Research Starting Point

Deepfake detection methods often succeed by specializing: one model is strong at facial boundaries, another at eye regions, and another at mouth artifacts. The authors start from the idea that these strengths should not compete but be integrated, because forgeries leave evidence at different scales and in different parts of the image. Their goal is to design a detector that captures this coarse-to-fine diversity explicitly, rather than hoping one monolithic feature map will discover it all.

Method

Face2Parts extracts features from the full frame, the face crop, and several key facial regions such as lips, eyes, and nose, then models their interactions through channel attention and deep triplet learning. This hierarchical feature representation is meant to capture both global context and small local artifacts while learning how those regions reinforce one another. The evaluation spans intra-dataset, inter-dataset, and inter-manipulation settings, which is critical because many detectors fail precisely when the manipulation style changes.

Paper Summary

Face2Parts is useful because it formalizes a very intuitive forensic workflow: first inspect the whole image, then zoom into the face, then zoom further into the most suspicious parts. The strong benchmark results suggest that this layered inspection process is not only interpretable but also effective. For practitioners, it is a reminder that deepfake detection can still improve by better structuring evidence, not just by scaling model size.

Paper 032026-03-23cs.CV

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

Authors & Institutions

Xinghan Li

Institute of Trustworthy Embodied AI, Fudan University, China

Shanghai Key Laboratory of Multimodal Embodied AI, China

Junhao Xu

Institute of Trustworthy Embodied AI, Fudan University, China

Shanghai Key Laboratory of Multimodal Embodied AI, China

Jingjing Chen

Institute of Trustworthy Embodied AI, Fudan University, China

Shanghai Key Laboratory of Multimodal Embodied AI, China

What Problem It Solves

VIGIL separates planning from examination so the detector decides which parts deserve inspection before part-level evidence is injected.

Key Result

Across OmniFake and cross-dataset tests, the authors report stronger generalization than both expert detectors and earlier MLLM-based approaches.

Abstract

Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.

Research Starting Point

MLLM-based deepfake detectors promise explainability, but many of them blur together two very different steps: deciding what to inspect and claiming what the evidence means. That fusion makes hallucination harder to detect because the model effectively invents both the observation and the conclusion at once. The paper is motivated by the need to separate these stages, so that deepfake reasoning looks more like forensic analysis and less like fluent improvisation.

Method

VIGIL uses a plan-then-examine pipeline in which the system first selects the facial parts worth investigating and only then injects region-specific forensic evidence into the reasoning process. The model also uses stage-gated evidence delivery and progressive training with part-aware reinforcement rewards, so that explanations stay tied to plausible anatomy and consistent evidence chains. To test generalization more rigorously, the paper introduces OmniFake, a five-level benchmark that expands from foundational generators to in-the-wild social media data.

Paper Summary

The biggest contribution of VIGIL is structural: it treats explainable deepfake detection as a pipeline design problem, not just a prompting problem. By forcing the detector to choose parts first and explain second, the framework makes it easier to distinguish grounded evidence from hallucinated storytelling. That makes the paper especially relevant for teams that want detectors whose explanations can be reviewed by humans, not just admired in demos.

Paper 042026-03-25cs.CV

Unleashing Vision-Language Semantics for Deepfake Video Detection

Authors & Institutions

Jiawen Zhu

Singapore Management University, Singapore

Yunqi Miao

The University of Warwick, UK

Xueyi Zhang

Nanyang Technological University, Singapore

Jiankang Deng

Imperial College London, UK

Guansong Pang

Singapore Management University, Singapore

What Problem It Solves

VLAForge asks how to convert cross-modal semantics into a stronger discriminative signal for both classical face swaps and newer full-face synthetic videos.

Key Result

The paper reports substantial gains over prior video deepfake detection methods at both frame and video level across face-swapping and full-face generation benchmarks.

Abstract

Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

Research Starting Point

Vision-language models such as CLIP have shown impressive transfer ability, yet many deepfake video detection methods still use them as if they were just stronger visual encoders. The authors argue that this wastes the most distinctive part of the model: the cross-modal semantic space itself. Their motivation is to turn that latent semantic alignment into a discriminative cue for deepfake detection, especially when generalizing across both classical face swaps and newer full-face synthetic videos.

Method

The proposed VLAForge framework adds a ForgePerceiver to mine subtle forgery cues while preserving the original vision-language alignment learned by the pretrained VLM. It then introduces an identity-aware vision-language alignment score, supported by identity-informed prompts, so the cross-modal space becomes more sensitive to authenticity mismatches. This allows the detector to combine artifact perception and semantic comparison instead of relying on one of them alone.

Paper Summary

The paper’s central message is that deepfake video detection can gain real robustness by using vision-language semantics properly rather than as decoration. VLAForge shows that semantic alignment, identity priors, and forgery-specific perception can work together instead of competing. For readers following the future of deepfake defense, this is a meaningful step toward detectors that are both more generalizable and more conceptually grounded.