MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion
Authors & Institutions
Bin Liu
School of Communication and Information Engineering, Shanghai University, Shanghai, China
Zhixiang Xiong
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
Zhifen He
School of Communication and Information Engineering, Shanghai University, Shanghai, China
Bo Li
School of Communication and Information Engineering, Shanghai University, Shanghai, China
What Problem It Solves
The work tackles how to generate more detailed and temporally coherent 3D talking-head motion from speech, with better lip and eye synchronization than earlier approaches.
Key Result
The authors report significant improvements over prior methods, especially on lip-sync and eye-movement synchronization, showing clearer gains in the motion quality users notice first.
Abstract
MMTalker is a 3D speech-driven talking-head system that combines multiresolution facial geometry with multimodal feature fusion. It uses mesh parameterization, differentiable sampling, graph convolutions, and cross-attention to improve lip sync and expressive detail in generated facial motion.
Research Starting Point
Swap-adjacent avatar systems still struggle with the hardest part of motion realism: getting speech-driven mouth dynamics and facial detail to look synchronized rather than merely plausible. The paper starts from the observation that audio-to-face generation is highly ill-posed, especially when the target is detailed 3D geometry instead of a soft 2D portrait animation. It is motivated by the need for richer geometry and stronger multimodal fusion in creator and avatar pipelines.
Method
MMTalker introduces a multiresolution representation of facial geometry, differentiable non-uniform sampling for detailed supervision, a residual graph convolutional network, and dual cross-attention for multimodal feature fusion. A lightweight regressor then predicts vertex-level displacements, letting the method work closer to explicit 3D motion instead of only image-space tricks.
Paper Summary
This is relevant to face swapping because the same buyers increasingly compare swap tools against talking-avatar systems. Better 3D motion fidelity is becoming part of the competitive baseline for any face-editing product that needs believable video output.