Home Finetune VITS and MMS TTS on Low Resource Language Like Uyghur
Post
Cancel

Finetune VITS and MMS TTS on Low Resource Language Like Uyghur

Introduction

VITS is a light weight, low-latency model for English text-to-speech (TTS). Massively Multilingual Speech (MMS) is an extension of VITS for multilingual TTS, that supports over 1100 languages. Both use the same underlying VITS architecture, consisting of a discriminator and a generator for GAN-based training. They differ in their tokenizers: the VITS tokenizer transforms English input text into phonemes, while the MMS tokenizer transforms input text into character-based tokens. You should fine-tune VITS-based checkpoints if you want to use a permissive English TTS model and fine-tune MMS-based checkpoints for every other cases. Coupled with the right data and the following training recipe, you can get an excellent finetuned version of every VITS/MMS checkpoints in 20 minutes with as little as 80 to 150 samples.

This post is licensed under CC BY 4.0 by the author.