An End to End Bilingual TTS System for Fongbe and Yoruba
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This paper aims to present an end to end bilingual TTS system for Yoruba and Fongbe based on Fastspeech 2, a non-autoregressive model. From this baseline, a simple concatenation of speaker, language and phoneme embeddings
was used as input for the encoder and the decoder. The training was done on a multi-speaker dataset collected for both languages. Two types of input were used: a shared representation of phoneme between both languages and a language specific representation of phonemes. Then some experimentations were made to test both input representations showing that results are smoother for the shared
representation of phoneme. But with all input sets, the proposed model was able
to synthesize speech in each language with voice cloning ability. The model produces good speech quality waveform with great fidelity and naturalness and shows its ability to generate speech waveforms for both languages. A comparison was
also made between the proposed bilingual system and the same model trained on monolingual dataset to show that the bilingual dataset allows more accurate result.
