Emotional 3D talking head generation synthesizes vivid facial expressions with precise lip synchronization for immersive interactions. In this paper, we introduce DisenEMO, a novel framework designed to disentangle emotion and content from facial motions, thereby facilitating the synthesis of personalized and expressive audio-driven facial animations. To achieve precise emotional disentanglement, we incorporate an intensity perception constraint which improves the accurate perception of categorized emotion and its intensities, leading to the generation of subtle emotional expressions. To ensure the temporal consistency of facial expressions, we introduce facial dynamic modeling, which refines motion trajectories to better capture emotional nuances. Finally, a motion decoder integrates emotional features with audio features extracted from driving speech, producing 3D talking heads with enhanced emotional expressiveness and realism.
DisenEMO takes an audio sequence as content input and a facial motion sequence as emotional reference. Using facial motion as emotional source allows for accurate facial expression reconstruction.
Emotion and content features are derived from facial motion inputs, switched and recombined to reconstruct corresponding outputs. This cyclic approach reduces the dependencies of specific paired training samples.
Comparison of facial motion generated on 3D-HDTF (left, without emotion) and MEAD-3D (right, with various emotions). Our method produces expressive facial movements that match the emotions, achieving performance comparable to the ground truth with a noticeable range of motion.
T-SNE visualization of emotion feature on MEAD-3D. (a) with triplet constraint, (b) without triplet constraint. Features are clustered according to emotional categories. Darker colors indicate higher emotion intensity.
To be updated.