Text-to-Speech
Transformers
Safetensors
English
higgs_multimodal_qwen3
feature-extraction
tts
voice-cloning
higgs-audio
qwen3
custom_code
Instructions to use multimodalart/higgs-audio-v3-tts-4b-transformers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use multimodalart/higgs-audio-v3-tts-4b-transformers with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="multimodalart/higgs-audio-v3-tts-4b-transformers", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("multimodalart/higgs-audio-v3-tts-4b-transformers", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Higgs Audio v3 TTS (4B) — transformers trust_remote_code port
A trust_remote_code packaging of bosonai/higgs-audio-v3-tts-4b
that loads with plain 🤗 transformers (no SGLang). The weights are the original
checkpoint, copied unchanged; only a small modeling_*.py / configuration_*.py
pair and an auto_map were added.
The model is a standard Qwen3-4B backbone plus a fused multi-codebook audio
embedding/head. Reference-audio encoding and waveform decoding use the
transformers-native bosonai/higgs-audio-v2-tokenizer
(higgs_audio_v2_tokenizer), loaded automatically on first use.
Requires transformers >= 5.5.
Usage
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "multimodalart/higgs-audio-v3-tts-4b-transformers"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
repo, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()
# Zero-shot TTS
wav = model.generate_speech("Hello, this is Higgs Audio running on transformers.", tokenizer)
torchaudio.save("out.wav", wav.unsqueeze(0), model.config.sample_rate)
# Voice cloning from a reference clip (+ optional transcript)
ref, sr = torchaudio.load("reference.wav")
wav = model.generate_speech(
"Now speaking in the cloned voice.",
tokenizer,
reference_audio=ref,
reference_sample_rate=sr,
reference_text="optional transcript of the reference clip",
temperature=0.7,
top_p=0.95,
)
torchaudio.save("clone.wav", wav.unsqueeze(0), model.config.sample_rate)
generate_speech returns a mono 24 kHz waveform as a CPU float32 tensor [L].
Notes
- Generation uses Higgs' delay pattern across 8 codebooks (vocab 1026, incl. BOC/EOC specials); de-delay + decode are handled internally.
- The codec runs in fp32 (decode is unstable in bf16); the LM backbone runs in the dtype you load it in (bf16 recommended).
- License: research/non-commercial, inherited from the upstream checkpoint — see
LICENSE.
- Downloads last month
- 1,664
Model tree for multimodalart/higgs-audio-v3-tts-4b-transformers
Base model
bosonai/higgs-audio-v3-tts-4b