Cancel

Finetune VITS and MMS TTS on Low Resource Language Like Uyghur

By Piyazon

Posted Sep 26 1 min read

Introduction

VITS is a light weight, low-latency model for English text-to-speech (TTS). Massively Multilingual Speech (MMS) is an extension of VITS for multilingual TTS, that supports over 1100 languages. Both use the same underlying VITS architecture, consisting of a discriminator and a generator for GAN-based training. They differ in their tokenizers: the VITS tokenizer transforms English input text into phonemes, while the MMS tokenizer transforms input text into character-based tokens. You should fine-tune VITS-based checkpoints if you want to use a permissive English TTS model and fine-tune MMS-based checkpoints for every other cases. Coupled with the right data and the following training recipe, you can get an excellent finetuned version of every VITS/MMS checkpoints in 20 minutes with as little as 80 to 150 samples.

Blogging, Tutorial

This post is licensed under CC BY 4.0 by the author.

Finetune VITS and MMS TTS on Low Resource Language Like Uyghur

Introduction

Recent Update

Trending Tags

Contents

Trending Tags

Finetune VITS and MMS TTS on Low Resource Language Like Uyghur

Introduction

Recent Update

Trending Tags

Contents

Further Reading

Ubuntu install wechat (Ubuntu 安装微信)

Set Up NextCloud and Coturn on Ubuntu 22 Server

Fix Locale warning when ssh in to server

Trending Tags