Qualitative Comparison

Text Transcript: Dogs are sitting by the door.

Predicted Facial Emotion: Happy



Text Transcript: Dogs are sitting by the door.

Predicted Facial Emotion: Neutral



Text Transcript: No price is too high when true love is at stake.

Predicted Facial Emotion: Disgust



Text Transcript: Please dig my potatoes up before frost.

Predicted Facial Emotion: Surprised



Text Transcript: Before Thursday's exam, review every formula.

Predicted Facial Emotion: Sad



Text Transcript: They enjoy it when I audition.

Predicted Facial Emotion: Sad



Text Transcript: I just saw Jim near the new archaeological museum.

Predicted Facial Emotion: Surprised



Text Transcript: The eastern coast is a place for pure pleasure and excitement.

Predicted Facial Emotion: Happy



Text Transcript: They're both exactly the same size and shape to an uncanny degree.

Predicted Facial Emotion: Neutral



Text Transcript: But some institutions would share much more pain than others.

Predicted Facial Emotion: Neutral



See Mel-Spectrogram Comparison

As shown in figure, we compare text-free L2S and text-dependent F2S methods. For L2S methods, we observe severe over-smoothing or acoustic artifacts, leading to significant degradation in speech quality and limiting their practical value. For F2S methods, while they can generate higher-quality speech with the aid of reference text, they struggle to align with the video, resulting in poor lip synchronization. In contrast, our method generates speech with richer acoustic details and precise lip synchronization, benefiting from our cross-modal diffusion process in the discrete space, which effectively addresses the one-to-many mapping issues.