← Back to Blog

Research RadarFace SwappingarXivJune 2026

Monthly arXiv Radar

June 2026 Face Swapping Papers: Conversational Talking Faces, Fast Portrait Animation, and Privacy Protection

June 2026 face swapping research splits into two product directions: more interactive talking faces and stronger defenses against unauthorized identity transfer. The month is less about a single swap model and more about the surrounding system requirements: speed, multi-person behavior, and protection.

What This Month Signals

The month shows synthesis moving toward interactive systems while defenses become more threat-model-specific. That combination is exactly where buyer requirements now sit: believable motion, low latency, and guardrails against misuse.

Paper 012026-06-30cs.CV

Towards Flexible, Natural, Efficient Interaction for Conversational Talking Face Generation

Authors & Institutions

Baiqin Wang

MAIS, Institute of Automation, Chinese Academy of Sciences

School of Artificial Intelligence, University of Chinese Academy of Sciences

Sen Chen

MAIS, Institute of Automation, Chinese Academy of Sciences

School of Artificial Intelligence, University of Chinese Academy of Sciences

Jiankuo Zhao

MAIS, Institute of Automation, Chinese Academy of Sciences

School of Artificial Intelligence, University of Chinese Academy of Sciences

Xiangyu Liu

MAIS, Institute of Automation, Chinese Academy of Sciences

School of Artificial Intelligence, University of Chinese Academy of Sciences

Zhen Lei

MAIS, Institute of Automation, Chinese Academy of Sciences

School of Artificial Intelligence, University of Chinese Academy of Sciences

CAIR, HKISI, Chinese Academy of Sciences

School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology

Xiangyu Zhu

MAIS, Institute of Automation, Chinese Academy of Sciences

School of Artificial Intelligence, University of Chinese Academy of Sciences

What Problem It Solves

The paper addresses the gap between speaking-only generation and real conversation, where arbitrary participant counts, long sessions, non-verbal feedback, and low latency all have to coexist.

Key Result

The authors report superior interaction quality while maintaining real-time generation at 30 FPS, which is the key threshold for online conversational use.

Abstract

Conversational talking face generation has recently attracted increasing attention, aiming to synthesize interactive talking videos where characters speak, listen, and respond dynamically to each other. This task presents three core challenges: 1) Flexibility: enabling multi-round dialogues with an arbitrary number of participants; 2) Naturalness: maintaining coherent motion and appropriate non-verbal feedback throughout the interaction; and 3) Efficiency: achieving real-time generation and low computation overhead for long-term continuous online conversation. Despite recent advances, existing methods still fall short in balancing all three requirements. To bridge this gap, we introduce InterTalk, a novel and efficient framework designed for highly interactive conversational talking face generation. Built upon a motion-based architecture, InterTalk supports real-time conversation synthesis. Our method achieves strong flexibility by explicitly modeling multi-round conversational dynamics among each participant, eliminating constraints on their numbers. To enhance interactivity, we incorporate motion feedback from multiple participants and introduce an iterative generation strategy for more natural behaviors. Besides, we disentangle motion into several facial components, enabling targeted refinements for natural response such as precise lip sync and realistic eye blinking. Finally, we construct a new multi-person conversational dataset and enrich it with 3D face-based data augmentation. Extensive experiments demonstrate that InterTalk achieves superior interaction quality while maintaining real-time performance at 30 FPS.

Research Starting Point

Talking-face systems are moving from single clips to persistent agents, tutors, assistants, and meeting avatars, where listening behavior and turn-taking matter as much as lip sync.

Method

The framework models conversational dynamics participant by participant, uses feedback motion from other speakers/listeners, iteratively refines behavior, and separates facial components so lip motion, eye blinking, and response gestures can be improved independently.

Paper Summary

InterTalk broadens the face-swapping/talking-head stack toward interactive digital humans. The practical question shifts from “can it lip-sync a clip?” to “can it sustain a believable exchange with multiple roles under real-time constraints?”

Paper 022026-06-29cs.CV

SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation

Authors & Institutions

Juncheng Ma

Shenzhen Graduate School, Peking University, China

Yuxuan Du

Shenzhen Graduate School, Peking University, China

Yanan Sun

Shanghai AI Laboratory, China

Zhening Xing

Shanghai AI Laboratory, China

Changlin Li

Tencent Hunyuan, China

Zhenyu Tang

Shenzhen Graduate School, Peking University, China

Bo Li

vivo, China

Peng-Tao Jiang

vivo, China

Li Yuan

Shenzhen Graduate School, Peking University, China

Daquan Zhou

Shenzhen Graduate School, Peking University, China

Yonghong Tian

Shenzhen Graduate School, Peking University, China

What Problem It Solves

The paper fixes a mismatch in generic diffusion caching: text-to-video assumptions do not capture the spatial and modality imbalance of audio-driven faces.

Key Result

The method reports up to 4.12x acceleration on HunyuanVideo-Avatar and 3.75x on Wan-S2V with near-lossless visual fidelity and precise audio alignment.

Abstract

Diffusion Transformers (DiTs) have significantly advanced audio-driven portrait animation, but their high computational cost leads to substantial inference latency. Although training-free diffusion caching accelerates inference significant, existing methods are primarily developed for text-conditioned generation and overlook the spatial and modality imbalances inherent in audio-driven portrait animation. In this paper, we propose SyncCache, a training-free caching acceleration method tailored for DiT-based portrait animation that explicitly exploits asymmetric dynamics. Specifically, high-frequency dynamics driven by audio conditions and concentrated in human regions are more challenging and critical to cache and reuse than the low-frequency visual background in portrait animation. First, we introduce Spatially-Asymmetric Probing to prioritize error sensitivity in dynamic human region. Second, through Modality-Decoupled Caching, we bypass heavy DiT block by reusing stable inter-block residuals, while continuously recomputing lightweight audio blocks to preserve precise lip synchronization. Furthermore, we introduce a cache ratio to control cache capacity and formulate memory-adaptive cache selection as an offline dynamic programming problem without online overhead. Extensive experiments demonstrate that SyncCache achieves superior speed-quality trade-offs, delivering up to 4.12x acceleration on HunyuanVideo-Avatar and 3.75x on Wan-S2V with near-lossless visual fidelity and precise audio alignment.

Research Starting Point

Portrait animation diffusion models are becoming powerful but slow; production avatar systems need speedups that do not break lip synchronization or facial detail.

Method

SyncCache combines Spatially-Asymmetric Probing, Modality-Decoupled Caching, and memory-adaptive offline cache selection. The design keeps recomputing audio-sensitive parts while bypassing expensive DiT blocks where residuals remain stable.

Paper Summary

SyncCache is valuable because it attacks inference cost without retraining the generator. For avatar products, that means faster previews, lower cloud cost, and more realistic chances of interactive audio-driven portrait generation.

Paper 032026-06-30cs.CV

Phantom: A Unified Face-Swap Deepfake Protection Framework with Latent and Spatial Constraints

Authors & Institutions

Jungkon Kim

Samsung Electronics, AI Platform Center

Cheolseung Jung

Samsung Electronics, AI Platform Center

Jong-Min Choi

Samsung Electronics, AI Platform Center

Juseong Lee

Samsung Electronics, AI Platform Center

What Problem It Solves

The paper targets weaknesses of prior adversarial protections: random targets create ambiguous latent directions, and unconstrained noise spills into identity-irrelevant regions.

Key Result

On UniFace, INSwapper, and SimSwap, Phantom improves dodging protection success by 27.8%, 25.6%, and 16.6%; it also improves impersonation protection by up to 10.2% while improving perceptual fidelity.

Abstract

Face-swapping deepfakes pose an escalating threat to personal privacy by enabling unauthorized identity manipulation. While adversarial approaches have demonstrated success against black-box face recognition (FR) models, their applicability to face-swapping scenarios remains underexplored. In particular, reliance on fixed or random targets yields ambiguous latent guidance, and the lack of explicit spatial constraints causes perturbations to spill into identity-irrelevant regions. These issues are further exacerbated by identity-style disentanglement, which suppresses adversarial signals during deepfake generation. In this paper, we present Phantom, a unified face-swap deepfake protection framework that jointly constrains perturbations in latent and spatial domains. Phantom adaptively synthesizes identity-shifted yet attribute-preserving targets to guide identity-aware latent optimization, and applies masked perturbations confined to semantically relevant facial regions. Extensive experiments on state-of-the-art face-swapping deepfakes demonstrate that Phantom improves protection success rates in dodging scenarios by 27.8%, 25.6%, and 16.6% on UniFace, INSwapper, and SimSwap, respectively, while also enhancing visual quality. Furthermore, Phantom generalizes to impersonation scenario, yielding up to 10.2% higher protection while improving perceptual fidelity. These results underscore the effectiveness of jointly leveraging latent and spatial constraints for robust and coherent facial privacy protection.

Research Starting Point

Deepfake detection is reactive; people and brands also need controls that make unauthorized face swapping fail before the manipulated video is created.

Method

Phantom jointly optimizes latent and spatial constraints: it builds identity-aware directions with attribute-preserving targets, then applies masked perturbations only where face semantics matter for swapping.

Paper Summary

Phantom is important because it treats face-swap defense as its own threat model rather than borrowing face-recognition attacks unchanged. For consumer photo services and celebrity/brand protection, the spatially constrained design is especially relevant because protection must not make the source image look obviously damaged.