---
license: other
license_name: bosonai-higgs-audio-v3
license_link: LICENSE
language:
- en
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- voice-cloning
- higgs-audio
- qwen3
- custom_code
base_model:
- bosonai/higgs-audio-v3-tts-4b
library_name: transformers
---

# Higgs Audio v3 TTS (4B) — transformers `trust_remote_code` port

A `trust_remote_code` packaging of [`bosonai/higgs-audio-v3-tts-4b`](https://huggingface.co/bosonai/higgs-audio-v3-tts-4b)
that loads with plain 🤗 transformers (no SGLang). The weights are the original
checkpoint, copied unchanged; only a small `modeling_*.py` / `configuration_*.py`
pair and an `auto_map` were added.

The model is a standard **Qwen3-4B** backbone plus a fused multi-codebook audio
embedding/head. Reference-audio encoding and waveform decoding use the
transformers-native [`bosonai/higgs-audio-v2-tokenizer`](https://huggingface.co/bosonai/higgs-audio-v2-tokenizer)
(`higgs_audio_v2_tokenizer`), loaded automatically on first use.

Requires `transformers >= 5.5`.

## Usage

```python
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "multimodalart/higgs-audio-v3-tts-4b-transformers"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

# Zero-shot TTS
wav = model.generate_speech("Hello, this is Higgs Audio running on transformers.", tokenizer)
torchaudio.save("out.wav", wav.unsqueeze(0), model.config.sample_rate)

# Voice cloning from a reference clip (+ optional transcript)
ref, sr = torchaudio.load("reference.wav")
wav = model.generate_speech(
    "Now speaking in the cloned voice.",
    tokenizer,
    reference_audio=ref,
    reference_sample_rate=sr,
    reference_text="optional transcript of the reference clip",
    temperature=0.7,
    top_p=0.95,
)
torchaudio.save("clone.wav", wav.unsqueeze(0), model.config.sample_rate)
```

`generate_speech` returns a mono 24 kHz waveform as a CPU float32 tensor `[L]`.

## Notes

- Generation uses Higgs' delay pattern across 8 codebooks (vocab 1026, incl.
  BOC/EOC specials); de-delay + decode are handled internally.
- The codec runs in fp32 (decode is unstable in bf16); the LM backbone runs in
  the dtype you load it in (bf16 recommended).
- License: research/non-commercial, inherited from the upstream checkpoint — see
  [`LICENSE`](LICENSE).