Instructions to use legekka/diana-hungarian-tts-vits with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use legekka/diana-hungarian-tts-vits with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="legekka/diana-hungarian-tts-vits")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("legekka/diana-hungarian-tts-vits") model = AutoModelForMultimodalLM.from_pretrained("legekka/diana-hungarian-tts-vits") - Notebooks
- Google Colab
- Kaggle
Diana Hungarian TTS (VITS)
Hungarian single-speaker TTS model trained on the KTH/hungarian-single-speaker-tts dataset (10 hours, single female speaker from LibriVox).
Converted from jaywalnut310/vits format to HuggingFace Transformers.
Audio Samples
"Egy magyar készítésű TTS modell vagyok."
"Alapvetően az Egri Csillagok felolvasásából születtem, így lehet, hogy egy kicsit monoton a hangom néha."
CSS10 Hungarian: Single Speaker Speech Dataset
The corpus consists of a single speaker, with 4515 segments extracted from this single LibriVox audiobook. It consists about 10 hours of audio data.
Training
The model was trained on a single RTX 3090 GPU for 3 days, 200K steps with a batchsize of 16. We saved some checkpoints with the optimizers, so the model could be train further, however we didn't find any noticable effect after step 150K. Recently converted to HuggingFace Transformers format for easier usage.
Usage
Requires: pip install phonemizer and sudo apt install espeak-ng.
from transformers import VitsTokenizer, VitsModel, set_seed
from phonemizer import phonemize
import re, torch
def hungarian_cleaners(text):
text = text.lower()
text = re.sub(r'\bmr\.', 'mister', text)
text = re.sub(r'\bdr\.', 'doktor', text)
phonemes = phonemize(text, language='hu', backend='espeak',
strip=True, preserve_punctuation=True, with_stress=True)
return re.sub(r'\s+', ' ', phonemes)
tokenizer = VitsTokenizer.from_pretrained("legekka/diana-hungarian-tts-vits-hf")
model = VitsModel.from_pretrained("legekka/diana-hungarian-tts-vits-hf")
text = "Helló! Diána hangja vagyok."
cleaned = hungarian_cleaners(text)
inputs = tokenizer(cleaned, return_tensors="pt", phonemize=False)
set_seed(42)
with torch.no_grad():
outputs = model(inputs["input_ids"])
from scipy.io.wavfile import write
write("output.wav", 22050, outputs.waveform[0].numpy())
Model Details
| Parameter | Value |
|---|---|
| Architecture | VITS |
| Sampling rate | 22050 Hz |
| Vocab size | 178 |
| Text cleaner | hungarian_cleaners (IPA via espeak-ng) |
| Training steps | 200K |
| GPU | RTX 3090 |
| License | CC-BY-4.0 |
- Downloads last month
- 193