ImaginTalk

Abstract

Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues, prompting us to propose Consistent Video-to-Speech (CV2S) as an extended task to enhance cross-modal consistency. To tackle emerging challenges, we introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input, operating within a discrete space. Specifically, we propose a discrete lip aligner that predicts discrete speech tokens from lip videos to capture semantic information, while an error detector identifies misaligned tokens, which are subsequently refined through masked language modeling with BERT. To further enhance the expressiveness of the generated speech, we develop a style diffusion transformer equipped with a face-style adapter that adaptively customizes identity and prosody dynamics across both the channel and temporal dimensions while ensuring synchronization with lip-aware semantic features. Extensive experiments demonstrate that ImaginTalk can generate high-fidelity speech with more accurate semantic details and greater expressiveness in timbre and emotion compared to state-of-the-art baselines.

Overview of ImaginTalk

The input face video is first processed by the face-style adapter to extract the global identity style $\mathbf{c}_\text{id}$ and temporal prosody style $\mathbf{c}_\text{emo}$, while the lip region of interest (ROI) is cropped and processed by a discrete lip aligner to learn semantic features $\mathbf{c}_\text{lip}$. Furthermore, masked speech features are obtained via the codec encoder and forward diffusion process. Finally, the Style-DiT takes these features as inputs to predict concrete scores through $ N $ Style-DiT blocks and 12 linear heads for 12 level tokens.

Qualitative Comparison

Text Transcript: Dogs are sitting by the door.

Predicted Facial Emotion: Happy

Text Transcript: Dogs are sitting by the door.

Predicted Facial Emotion: Neutral

Text Transcript: No price is too high when true love is at stake.

Predicted Facial Emotion: Disgust

Text Transcript: Please dig my potatoes up before frost.

Predicted Facial Emotion: Surprised

Text Transcript: Before Thursday's exam, review every formula.

Predicted Facial Emotion: Sad

Text Transcript: They enjoy it when I audition.

Predicted Facial Emotion: Sad

Text Transcript: I just saw Jim near the new archaeological museum.

Predicted Facial Emotion: Surprised

Text Transcript: The eastern coast is a place for pure pleasure and excitement.

Predicted Facial Emotion: Happy

Text Transcript: They're both exactly the same size and shape to an uncanny degree.

Predicted Facial Emotion: Neutral

Text Transcript: But some institutions would share much more pain than others.

Predicted Facial Emotion: Neutral

See Mel-Spectrogram Comparison

As shown in figure, we compare text-free L2S and text-dependent F2S methods. For L2S methods, we observe severe over-smoothing or acoustic artifacts, leading to significant degradation in speech quality and limiting their practical value. For F2S methods, while they can generate higher-quality speech with the aid of reference text, they struggle to align with the video, resulting in poor lip synchronization. In contrast, our method generates speech with richer acoustic details and precise lip synchronization, benefiting from our cross-modal diffusion process in the discrete space, which effectively addresses the one-to-many mapping issues.

Demo Video

Abstract

Overview of ImaginTalk

Qualitative Comparison