Facetron: A Multi-speaker Face-to-Speech Model Based on Cross-Modal Latent Representations

Accepted to EUSIPCO 2023

Paper

Abstract

In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions.

Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using crossmodal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images.

We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the naturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).

Small dataset

We trained our model using a small dataset consisting of four speakers (s1, s2, s4, s29).
Although Lip2Wav samples show similar speech quality with Facetron, pronunciation of Facetron samples is more correct compared to all of previous models.

Speaker	Silent video	Reference	Facetron (Ours)	Lip2Wav	Voc-based	GAN-based	Text
s1							place white with y 2 now
s2							lay blue in d 1 now
s4							bin green by u 1 please
s29							set white in r 8 please

Large dataset - seen

Out of the 33 speakers in the full dataset, we set four speakers (s1, s2, s4, s29) as unseen speakers and excluded them from the training process.
Samples below are from seen speakers.
Facetron outperforms than Lip2Wav in terms of speaker similiarty, speech quality and accuracy.

Speaker	Silent video	Reference	Facetron (Ours)	Lip2Wav	Text
s5					bin red at f 6 please
s6					bin green with n 8 soon
s11					bin white at f 7 again
s14					bin blue by k 9 please
s15					bin blue by k 6 now
s23					bin green with s 7 again

Large dataset - unseen

Samples below are from unseen speakers.
Facetron generates correct gender voice (male face with male voice / female face with female voice).
Facetron maintains the speech quality for unseen speakers, which is not maintained in Lip2Wav.

Speaker	Silent video	Reference	Facetron (Ours)	Lip2Wav	Text
s1					lay blue with y 7 soon
s1					place green with r 4 now
s2					bin white at a 2 again
s2					set green in i 1 now
s4					bin green with b 3 please
s4					lay white by z 4 again
s29					bin green in q 6 now
s29					place white in a 0 now

Ablation study - disentanglement

Samples below show successful disentanglement of linguistic and speaker identity features in Facetron.
They are synthesized using lip features and face embedding from different speakers on the large dataset scenario.
Facetron models lip features from original lip movements (without speech) and estimates face embedding from a target face.
Therefore, the synthesized speech should contain the same text with that of the reference speech and the voice should be able to match the target face.

Reference speech with original face	Target face	Synthesized speech with target face

Ablation study - effect of cosine similarity (CS) loss

We conducted ablation study to verify the effectiveness of CS loss.
Facetron, which is trained with CS loss, produces clearer and more intelligible speech samples for the predicted unseen speaker compared to the model without CS loss.
The samples from the model without CS loss show unclear pronunciation and poor sound quality because the generation process are not confined to a specific speaker’s characteristic.

Reference speech with original face	With CS loss (Facetron)	Without CS loss