OpenYourMind

Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — MTPLX 4-bit (MoE MTP head)

Overview

This is the MLX 4-bit build of OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated with the Qwen3.5 MoE Multi-Token-Prediction (MTP) head included, packaged for MTPLX native MTP speculative decoding on Apple Silicon.

The language/vision weights are byte-identical to the -MLX-4bit build. The only additions are the MTP head (mtp.safetensors, BF16, 4.7 GB) and a config.json pointer (mlx_lm_extra_tensors.mtp_file). That sidecar is ignored by plain mlx-lm/mlx-vlm, so this folder still loads as an ordinary MLX model — but with MTPLX it also drives speculative decoding.

  • Language: 4-bit, group size 64 (MoE routing gates kept at higher precision by the model's quant predicate), ≈ 4.5 bits/weight.
  • MTP head: 1 layer, MoE (router + 256 experts / 8 active + shared expert), full self-attention, BF16 (785 tensors). MTPLX stacks the experts into switch_mlp at load and verifies every drafted token against the target model.
  • Vision: the BF16 vision tower from the base build is still present; MTPLX runs the text path only. For image input, use the -MLX-4bit repo with mlx-vlm.

⚠️ Requires MTPLX with Qwen3.5-MoE MTP support

The Qwen3.5 MTP head is an MoE block. MTPLX ≤ 0.3.7 only supported a dense Qwen MTP head and will reject this model with invalid-mtp-tensor-layout. Support is added in MTPLX PR #84.

Until that lands in a release, install from the branch:

pip install "git+https://github.com/janfeddersen-wq/MTPLX.git@qwen3-5-moe-mtp"
# after the PR is merged & released:  pip install -U mtplx

Usage

MODEL=OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MTPLX-4bit

# one-shot, with acceptance stats
mtplx ask --model "$MODEL" --prompt "Explain Rayleigh scattering simply." --mtp --stats --yes

# interactive terminal chat
mtplx start cli --model "$MODEL" --yes

# OpenAI-compatible server
mtplx quickstart --model "$MODEL" --port 8000 --yes

--yes accepts the "family-compatible-unverified" gate (no recorded exactness baseline is shipped). Add --no-mtp to compare against plain autoregressive decoding.

Measured (M5 Max, 128 GB)

120-token greedy run: depth-1 acceptance ≈ 70 %, accepted_by_depth = [40, 19, 3] of [57, 57, 56] drafted → 120 tokens in 57 target verify passes (≈ 2.1 tokens/verify), ~52 decode tok/s. (Contrary to the earlier note on the base card, the MoE MTP head does yield a real speedup once a runtime can consume it.)

Known limitation — MoE exactness

At temperature 0, MTP vs non-MTP greedy output is ~98 % identical and re-converges immediately, but occasionally flips a single token. This is the MoE router hitting a near-tie that resolves differently under batched verification vs single-token decode (an inherent MoE/FP effect), not a drafting error — the target model verifies every token. Strict bit-exactness for MoE heads is still being worked out (e.g. fp32 router logits during verify); see PR #84.

Files

File Description Size
model-*-of-00014.safetensors 4-bit language weights + BF16 vision tower ~65 GB
mtp.safetensors MoE MTP head (BF16) 4.7 GB
config.json Qwen3_5MoeForConditionalGeneration + quantization + mlx_lm_extra_tensors.mtp_file
tokenizer*, chat_template.jinja, generation_config.json, processor configs Standard

Total on disk: ~70 GB.

Hardware

Needs roughly ≥ 80 GB unified memory to load with usable context (65 GB base + ~5 GB BF16 MTP + KV cache). Runs comfortably on 96 GB+ M-series Macs.

Support & Community

☕ If these models are useful to you, consider supporting my work — it funds compute for more & larger abliterations.

Buy Me A Coffee

buymeacoffee.com/oym.kuato

Notes

Disclaimer

Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.

Downloads last month
1,820
Safetensors
Model size
20B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MTPLX-4bit