GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection
Authors & Institutions
Yaning Zhang
Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), China
Linlin Shen
Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, China
National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China
Shenzhen Institute of Artificial Intelligence and Robotics for Society, China
Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, China
Zitong Yu
School of Computing and Information Technology, Great Bay University, China
Chunjie Ma
Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), China
Zan Gao
Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), China
Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, China
What Problem It Solves
GazeCLIP targets both attribution and detection, asking whether gaze-aware cues can improve generalization to unseen forgery methods.
Key Result
On the authors' benchmark, the method beats prior state of the art by 6.56% average accuracy for attribution and 5.32% AUC for detection under unseen-generator settings.
Abstract
Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.
Research Starting Point
Deepfake detectors often overfocus on image appearance and fail badly once a new generator produces artifacts unlike the training set. The authors start from the observation that forged faces also exhibit differences in gaze behavior and gaze preservation, especially across GAN and diffusion pipelines, and that this cue has not been fully exploited. They are motivated by the need to improve both deepfake attribution and detection in a way that generalizes to unseen generators rather than collapsing on the next model release.
Method
GazeCLIP builds a gaze-aware CLIP-style framework in which visual forgery cues and gaze-based prompts are fused into a more stable forensic embedding space. The method introduces a gaze-aware image encoder and a language refinement encoder with adaptive word selection so that the text branch becomes more precise when describing authenticity cues. The paper also constructs a more fine-grained benchmark focused on attribution and detection under novel diffusion and flow-based generators, which strengthens the credibility of its evaluation story.
Paper Summary
The paper is compelling because it adds a new anatomical clue—gaze consistency—to the deepfake detection toolbox, instead of endlessly recycling the same texture-focused paradigm. That shift helps explain why the method improves on unseen generators, not just on familiar datasets. For readers following face forgery defense, GazeCLIP is a strong example of how multimodal reasoning can become practically useful.