Facetron

Facetron: A Multi-speaker Face-to-Speech Model Based on Cross-Modal Latent Representations

Accepted to EUSIPCO 2023

Paper

Abstract


In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions.

Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using crossmodal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images.

We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the naturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).


Small dataset


We trained our model using a small dataset consisting of four speakers (s1, s2, s4, s29).
Although Lip2Wav samples show similar speech quality with Facetron, pronunciation of Facetron samples is more correct compared to all of previous models.


Speaker Silent video Reference Facetron (Ours) Lip2Wav Voc-based GAN-based Text
s1 place white with
y 2 now
s2 lay blue in
d 1 now
s4 bin green by
u 1 please
s29 set white in
r 8 please


Large dataset - seen


Out of the 33 speakers in the full dataset, we set four speakers (s1, s2, s4, s29) as unseen speakers and excluded them from the training process.
Samples below are from seen speakers.
Facetron outperforms than Lip2Wav in terms of speaker similiarty, speech quality and accuracy.


Speaker Silent video Reference Facetron (Ours) Lip2Wav Text
s5 bin red at
f 6 please
s6 bin green with
n 8 soon
s11 bin white at f
7 again
s14 bin blue by
k 9 please
s15 bin blue by
k 6 now
s23 bin green with
s 7 again


Large dataset - unseen


Samples below are from unseen speakers.
Facetron generates correct gender voice (male face with male voice / female face with female voice).
Facetron maintains the speech quality for unseen speakers, which is not maintained in Lip2Wav.


Speaker Silent video Reference Facetron (Ours) Lip2Wav Text
s1 lay blue with
y 7 soon
s1 place green with
r 4 now
s2 bin white at
a 2 again
s2 set green in
i 1 now
s4 bin green with
b 3 please
s4 lay white by
z 4 again
s29 bin green in
q 6 now
s29 place white in
a 0 now


Ablation study - disentanglement


Samples below show successful disentanglement of linguistic and speaker identity features in Facetron.
They are synthesized using lip features and face embedding from different speakers on the large dataset scenario.
Facetron models lip features from original lip movements (without speech) and estimates face embedding from a target face.
Therefore, the synthesized speech should contain the same text with that of the reference speech and the voice should be able to match the target face.

Reference speech
with original face
Target face Synthesized speech
with target face


Ablation study - effect of cosine similarity (CS) loss


We conducted ablation study to verify the effectiveness of CS loss.
Facetron, which is trained with CS loss, produces clearer and more intelligible speech samples for the predicted unseen speaker compared to the model without CS loss.
The samples from the model without CS loss show unclear pronunciation and poor sound quality because the generation process are not confined to a specific speaker’s characteristic.

Reference speech
with original face
With CS loss (Facetron) Without CS loss