Higgs Audio v3 TTS (4B) — transformers trust_remote_code port

A trust_remote_code packaging of bosonai/higgs-audio-v3-tts-4b that loads with plain 🤗 transformers (no SGLang). The weights are the original checkpoint, copied unchanged; only a small modeling_*.py / configuration_*.py pair and an auto_map were added.

The model is a standard Qwen3-4B backbone plus a fused multi-codebook audio embedding/head. Reference-audio encoding and waveform decoding use the transformers-native bosonai/higgs-audio-v2-tokenizer (higgs_audio_v2_tokenizer), loaded automatically on first use.

Requires transformers >= 5.5.

Usage

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "multimodalart/higgs-audio-v3-tts-4b-transformers"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

# Zero-shot TTS
wav = model.generate_speech("Hello, this is Higgs Audio running on transformers.", tokenizer)
torchaudio.save("out.wav", wav.unsqueeze(0), model.config.sample_rate)

# Voice cloning from a reference clip (+ optional transcript)
ref, sr = torchaudio.load("reference.wav")
wav = model.generate_speech(
    "Now speaking in the cloned voice.",
    tokenizer,
    reference_audio=ref,
    reference_sample_rate=sr,
    reference_text="optional transcript of the reference clip",
    temperature=0.7,
    top_p=0.95,
)
torchaudio.save("clone.wav", wav.unsqueeze(0), model.config.sample_rate)

generate_speech returns a mono 24 kHz waveform as a CPU float32 tensor [L].

Notes

  • Generation uses Higgs' delay pattern across 8 codebooks (vocab 1026, incl. BOC/EOC specials); de-delay + decode are handled internally.
  • The codec runs in fp32 (decode is unstable in bf16); the LM backbone runs in the dtype you load it in (bf16 recommended).
  • License: research/non-commercial, inherited from the upstream checkpoint — see LICENSE.
Downloads last month
1,664
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for multimodalart/higgs-audio-v3-tts-4b-transformers

Finetuned
(1)
this model

Spaces using multimodalart/higgs-audio-v3-tts-4b-transformers 2