Towards Flexible, Natural, Efficient Interaction for Conversational Talking Face Generation
Authors & Institutions
Baiqin Wang
MAIS, Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
Sen Chen
MAIS, Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
Jiankuo Zhao
MAIS, Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
Xiangyu Liu
MAIS, Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
Zhen Lei
MAIS, Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
CAIR, HKISI, Chinese Academy of Sciences
School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology
Xiangyu Zhu
MAIS, Institute of Automation, Chinese Academy of Sciences
School of Artificial Intelligence, University of Chinese Academy of Sciences
What Problem It Solves
The paper addresses the gap between speaking-only generation and real conversation, where arbitrary participant counts, long sessions, non-verbal feedback, and low latency all have to coexist.
Key Result
The authors report superior interaction quality while maintaining real-time generation at 30 FPS, which is the key threshold for online conversational use.
Abstract
Conversational talking face generation has recently attracted increasing attention, aiming to synthesize interactive talking videos where characters speak, listen, and respond dynamically to each other. This task presents three core challenges: 1) Flexibility: enabling multi-round dialogues with an arbitrary number of participants; 2) Naturalness: maintaining coherent motion and appropriate non-verbal feedback throughout the interaction; and 3) Efficiency: achieving real-time generation and low computation overhead for long-term continuous online conversation. Despite recent advances, existing methods still fall short in balancing all three requirements. To bridge this gap, we introduce InterTalk, a novel and efficient framework designed for highly interactive conversational talking face generation. Built upon a motion-based architecture, InterTalk supports real-time conversation synthesis. Our method achieves strong flexibility by explicitly modeling multi-round conversational dynamics among each participant, eliminating constraints on their numbers. To enhance interactivity, we incorporate motion feedback from multiple participants and introduce an iterative generation strategy for more natural behaviors. Besides, we disentangle motion into several facial components, enabling targeted refinements for natural response such as precise lip sync and realistic eye blinking. Finally, we construct a new multi-person conversational dataset and enrich it with 3D face-based data augmentation. Extensive experiments demonstrate that InterTalk achieves superior interaction quality while maintaining real-time performance at 30 FPS.
Research Starting Point
Talking-face systems are moving from single clips to persistent agents, tutors, assistants, and meeting avatars, where listening behavior and turn-taking matter as much as lip sync.
Method
The framework models conversational dynamics participant by participant, uses feedback motion from other speakers/listeners, iteratively refines behavior, and separates facial components so lip motion, eye blinking, and response gestures can be improved independently.
Paper Summary
InterTalk broadens the face-swapping/talking-head stack toward interactive digital humans. The practical question shifts from “can it lip-sync a clip?” to “can it sustain a believable exchange with multiple roles under real-time constraints?”