Lightweight Complementary-Cue Fusion for Robust Video Face Forgery Detection
Authors & Institutions
Sunghwan Baek
Carnegie Mellon University, USA
Tariq Anwaar
Carnegie Mellon University, USA
Karanveer Singh
Carnegie Mellon University, USA
Rita Singh
Carnegie Mellon University, USA
What Problem It Solves
The paper tests whether carefully chosen handcrafted forensic cues can improve robustness without a large model.
Key Result
The extra module adds only 292 parameters, raises average AUC from 74.8% to 78.6% on FaceForensics++ and from 70.5% to 74.9% on DFDC-Preview, and outperforms F3Net, SRM, and SPSL across eight benchmarks.
Abstract
Current face video forgery detectors use wide or dual-stream backbones. We show that a single, lightweight fusion of two handcrafted cues can achieve higher accuracy with a much smaller model. Based on the Xception baseline model (21.9 million parameters), we build two detectors: LFWS, which adds a 1x1 convolution to combine a low-frequency Wavelet-Denoised Feature (WDF) with a phase-spectrum channel derived from Spatial-Phase Shallow Learning (SPSL), and LFWL, which merges WDF with Local Binary Patterns (LBP) in the same way. This extra module adds only 292 parameters, keeping the total at 21.9 million, smaller than F3Net (22.5 million) and less than half the size of SRM (55.3 million). Even with this minimal overhead, the fused models increase the average area under the curve (AUC) from 74.8% to 78.6% on FaceForensics++ and from 70.5% to 74.9% on DFDC-Preview, gains of 3.8% and 4.4% over the Xception baseline. They also consistently outperform F3Net, SRM, and SPSL in eight public benchmarks, without extra data or test-time augmentation. These results show that carefully paired, handcrafted features, combined through the lightweight fusion block, can provide competitive robustness at a significantly lower cost than comparable frequency-based detectors. Our findings suggest a need to reevaluate scale-driven design choices in face video forgery detection.
Research Starting Point
Many video face forgery detectors grow model width or add streams, which raises deployment cost.
Method
The authors add a tiny fusion block to Xception, combining Wavelet-Denoised Features with either phase-spectrum cues or Local Binary Patterns.
Paper Summary
The main lesson is that deepfake detection does not always need a larger backbone if the forensic cues are selected and fused well. By combining low-frequency wavelet-denoised features with phase or texture cues through a tiny fusion block, the paper offers a cost-conscious alternative for teams that need broader benchmark robustness without adding data, augmentation, or heavy inference cost.