Facetron: A Multi-speaker Face-to-Speech Model Based on Cross-Modal Latent Representations
Accepted to EUSIPCO 2023
Paper
Abstract
In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions.
Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using crossmodal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images.
We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the naturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).
Small dataset
We trained our model using a small dataset consisting of four speakers (s1, s2, s4, s29).
Although Lip2Wav samples show similar speech quality with Facetron, pronunciation of Facetron samples is more correct compared to all of previous models.
Speaker | Silent video | Reference | Facetron (Ours) | Lip2Wav | Voc-based | GAN-based | Text |
s1 | place white with y 2 now |
||||||
s2 | lay blue in d 1 now |
||||||
s4 | bin green by u 1 please |
||||||
s29 | set white in r 8 please |
Large dataset - seen
Out of the 33 speakers in the full dataset, we set four speakers (s1, s2, s4, s29) as unseen speakers and excluded them from the training process.
Samples below are from seen speakers.
Facetron outperforms than Lip2Wav in terms of speaker similiarty, speech quality and accuracy.
Speaker | Silent video | Reference | Facetron (Ours) | Lip2Wav | Text |
s5 | bin red at f 6 please |
||||
s6 | bin green with n 8 soon |
||||
s11 | bin white at f 7 again |
||||
s14 | bin blue by k 9 please |
||||
s15 | bin blue by k 6 now |
||||
s23 | bin green with s 7 again |
Large dataset - unseen
Samples below are from unseen speakers.
Facetron generates correct gender voice (male face with male voice / female face with female voice).
Facetron maintains the speech quality for unseen speakers, which is not maintained in Lip2Wav.
Speaker | Silent video | Reference | Facetron (Ours) | Lip2Wav | Text |
s1 | lay blue with y 7 soon |
||||
s1 | place green with r 4 now |
||||
s2 | bin white at a 2 again |
||||
s2 | set green in i 1 now |
||||
s4 | bin green with b 3 please |
||||
s4 | lay white by z 4 again |
||||
s29 | bin green in q 6 now |
||||
s29 | place white in a 0 now |
Ablation study - disentanglement
Samples below show successful disentanglement of linguistic and speaker identity features in Facetron.
They are synthesized using lip features and face embedding from different speakers on the large dataset scenario.
Facetron models lip features from original lip movements (without speech) and estimates face embedding from a target face.
Therefore, the synthesized speech should contain the same text with that of the reference speech and the voice should be able to match the target face.
Reference speech with original face |
Target face | Synthesized speech with target face |
Ablation study - effect of cosine similarity (CS) loss
We conducted ablation study to verify the effectiveness of CS loss.
Facetron, which is trained with CS loss, produces clearer and more intelligible speech samples for the predicted unseen speaker compared to the model without CS loss.
The samples from the model without CS loss show unclear pronunciation and poor sound quality because the generation process are not confined to a specific speaker’s characteristic.
Reference speech with original face |
With CS loss (Facetron) | Without CS loss |