LoCC: Detection and Localization of Lip-Syncing Deepfakes via Counterfactual Frame Consistency
Authors & Institutions
Soumyya Kanti Datta
University at Buffalo, State University of New York
Shan Jia
University at Buffalo, State University of New York
Siwei Lyu
University at Buffalo, State University of New York
What Problem It Solves
The paper solves the need for fine-grained localization: security reviewers want to know which frames or segments are fake, not only whether the whole video is suspicious.
Key Result
The authors report superior performance over state-of-the-art methods on LAV-DF, AVDF1M, FakeAVCeleb, and KODF, with generalization across compression levels and datasets.
Abstract
Lip-syncing deepfakes are among the most challenging forms of manipulated media because their artifacts are localized almost exclusively to the mouth region and evolve dynamically over time. Detecting such deepfakes requires precise temporal and spatial modeling of lip motion. In this paper, we propose LoCC, a novel detection framework that performs fine-grained detection and localization of lip-syncing deepfakes at both segment and frame levels. Unlike prior approaches that analyze videos holistically, our method evaluates whether each frame aligns with a counterfactual estimate generated from its temporal neighbors. Real videos exhibit strong and stable consistency, whereas lip-sync deepfakes introduce localized inconsistencies. Following a teacher-student learning paradigm, our model effectively captures these frame-level discrepancies and achieves superior performance over state-of-the-art methods on multiple benchmark lip-syncing deepfake datasets, including LAV-DF, AVDF1M, FakeAVCeleb, and KODF, and generalizes well across compression levels and datasets.
Research Starting Point
Lip-sync manipulations are hard because only the mouth region changes and the edited segment may be short, so holistic video or audio-visual detectors can miss the local inconsistency.
Method
LoCC trains a diffusion model on real mouth frames to predict a middle frame from adjacent frames. The teacher learns segment-level inconsistency from reconstruction errors and temporal relations; the student distills this into frame-wise predictions before a transformer aggregates longer context.
Paper Summary
LoCC is valuable for forensic workflows because it produces localized evidence rather than a single opaque score. The counterfactual framing is especially useful for short-form or partially edited videos where only a few mouth frames carry the manipulation signal.