Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB

OneBadUnit · June 2, 2026, 11:36pm

I tested Mellum2-12B-A2.5B-Instruct Q4_K_M on a Jetson Orin Nano 8GB.

System:

Jetson Orin Nano 8GB
Ubuntu 22.04.5
25W power mode
NVMe storage

Testing performed:

Built current official llama.cpp successfully.
Built the Mellum2 branch of llama.cpp successfully.
Downloaded the community GGUF successfully.
Verified model download and local file loading.

Results:

CUDA-enabled Mellum2 build consistently failed with CUDA allocation errors during model loading.
CPU-only Mellum2 build loaded successfully.
CPU-only inference technically worked, but generation speed was extremely slow and not practical for real-world use.
No usable configuration was found on my Jetson Orin Nano 8GB during testing.

This appears to be neither a download issue nor a GGUF corruption issue. The model can load in CPU-only mode, but I was unable to achieve practical GPU-accelerated inference on this hardware.

John6666 · June 3, 2026, 4:55am

Hmm… Even outside Jetson, if the practical memory budget is only around 8 GB, I think using LLMs much beyond the 7B–8B class is already pretty rough…

Below is my current read of the situation for Jetson Orin Nano 8GB + GGUF + llama.cpp, especially for trying to run Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf.

Short version

I would not treat this as a “bad GGUF file” first.

I would treat it as a memory-budget problem plus a Jetson runtime-contract problem:

Orin Nano 8GB has 8GB shared LPDDR5 memory, not “8GB VRAM + lots of normal system RAM”.
MoE “active parameters” help compute cost, but the runtime still has to deal with the full quantized weight file, KV cache, CUDA buffers, OS memory, file cache, and server overhead.
Mellum2-12B-A2.5B-Instruct is a 12B total / 2.5B active MoE model, so “A2.5B” does not make it equivalent to a 2.5B dense model for memory residency.
For this board, I would generally start from 2B–4B models, then maybe test 8B low-bit / edge-MoE models only after the Jetson stack and llama.cpp settings are known-good.
I would keep Mellum2 on this board in the “interesting experiment” bucket, not the “practical recommendation” bucket.

Useful official/semi-official starting points:

Why this is hard on Orin Nano 8GB

The key issue is not just parameter count. It is the whole runtime memory picture.

Component	Why it matters on Orin Nano 8GB
GGUF model weights	The quantized file still has to be mapped/loaded and used by the runtime. An 8GB-class GGUF is already too close to the total memory pool.
KV cache	Grows with context length, number of layers, KV heads, cache dtype, and parallel slots. Long context is expensive even if the weights fit.
CUDA buffers	GPU offload needs extra temporary and persistent CUDA allocations.
OS + desktop + services	They consume memory from the same 8GB pool.
File cache / mmap behavior	Can help or hurt depending on pressure; it does not create more physical RAM.
Swap	Can prevent crashes, but if actual inference is paging heavily, performance can become unusable.
CPU offload	On a desktop with 32GB+ RAM, this can be a real escape hatch. On Orin Nano 8GB, CPU and GPU share the same small memory pool.

This is why a model that is “interesting on a 32GB RAM PC” can still be a bad fit for Orin Nano 8GB.

Mellum2-specific issue

The model itself is interesting. It is not a toy model.

JetBrains/Mellum2-12B-A2.5B-Instruct is described as a coding/software-engineering model with:

12B total parameters
2.5B active parameters per token
MoE with 64 experts and 8 active experts
131,072-token context
GQA with 32 Q heads and 4 KV heads
software engineering, code generation/editing, tool use, function calling, and agentic workflows as core target use cases

Sources:

But for Orin Nano 8GB, the important part is:

12B total weights still matter. “2.5B active” helps per-token compute, but it does not magically make the full model fit like a 2.5B dense model.

That is also why larger Qwen MoE models can be interesting on a 32GB RAM CPU box, but not necessarily on Orin Nano 8GB. On a PC, llama.cpp can sometimes make good use of system RAM. On Orin Nano 8GB, the “system RAM” is still the same tiny shared pool.

Practical interpretation of NVIDIA/Jetson guidance

NVIDIA’s own forum practical guide says that, using llama.cpp with Q4_K GGUF, Orin Nano 8GB can fit approximately:

Class	Approximate upper range from the guide
LLMs	up to around 10B parameters
VLMs	up to around 4B parameters

Source: AI Models That Run on Jetson Orin Nano Super 8GB — A Practical Guide

I would read that as an upper-bound / best-case sizing guide, not as “10B will be comfortable”.

A more conservative operational reading is:

GGUF size / model class	My expectation on Orin Nano 8GB
<= 2GB	Usually comfortable if the runtime is configured correctly.
2GB–4GB	Best practical target zone.
4GB–5.5GB	Possible, but context/batch/KV/cache/offload settings matter a lot.
5.5GB–7GB	Experimental; expect tuning and instability.
7GB+	Usually not a good practical recommendation on this board.
8GB-class GGUF	The model file itself is too close to the total memory pool. Expect CUDA allocation failures, very short context, or unusable CPU fallback.

So, for this board, I would not start with a 12B MoE Q4 model. I would start with 2B–4B and then test 8B low-bit only if the baseline is stable.

First thing to check: JetPack / L4T version

There is a very relevant recent NVIDIA forum thread where Gemma 4 E4B Q4_K_M failed with CUDA OOM on Orin Nano, then NVIDIA identified a known memory issue in r36.4.7 that was fixed in r36.5.

Source:

Gemma4 E4B on Jetson Orin Nano fails due to CUDA out of memory issue

So before changing models, I would check the Jetson software stack:

cat /etc/nv_tegra_release
dpkg-query --show nvidia-l4t-core

If this shows an affected JetPack/L4T combination, update the runtime first. Otherwise model-level debugging can be misleading.

Recommended baseline: use the Jetson-oriented llama.cpp container first

Before doing custom builds, I would first try the NVIDIA AI IoT / Jetson AI Lab llama.cpp container.

The NVIDIA forum guide points to this container tag:

ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin

Source:

NVIDIA forum practical guide

Jetson AI Lab’s Gemma 4 E2B page also says the model is configured to run on Jetson with vLLM and llama.cpp, and describes E2B as the edge-first low-memory member of the Gemma 4 family:

Jetson AI Lab — Gemma 4 E2B

A conservative smoke test would be something like:

sudo docker run -it --rm --pull always \
  --runtime=nvidia \
  --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server \
    -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S \
    -c 1024 \
    -ngl auto \
    -fa on \
    -ctk q8_0 \
    -ctv q8_0 \
    -b 64 \
    -ub 32 \
    --host 0.0.0.0 \
    --port 8080

This is not because Gemma 4 E2B is necessarily the best coding model. It is because it is a good known-good Jetson baseline.

If this fails, I would suspect the Jetson stack, container/runtime, CUDA setup, power mode, memory pressure, or JetPack/L4T version before blaming Mellum2.

Runtime tuning checklist

These are the knobs I would try, in order.

1. Reduce context length aggressively

Do not start with 8K, 32K, or 128K context on Orin Nano 8GB.

Start with:

-c 512

Then only increase if stable:

-c 1024
-c 2048

For a very marginal model:

-c 256

In llama.cpp, --ctx-size directly affects KV cache memory. Long context is not free.

Reference:

llama.cpp server docs

2. Reduce batch and micro-batch

For memory-constrained Jetson runs, I would not use large defaults.

Try:

-b 64 -ub 32

If it still fails:

-b 32 -ub 16

This may slow prompt processing, but it can avoid allocation failures.

Reference:

llama.cpp server docs

3. Quantize the KV cache

Try safer KV cache quantization first:

-ctk q8_0 -ctv q8_0

If the model is still too tight:

-ctk q4_0 -ctv q4_0

I would not start with q4_0 unless memory is really tight. It can be less robust, but it is worth trying for a last-resort fit test.

Reference:

llama.cpp server docs

4. Use `-fit` / fitting mode when available

Recent llama.cpp builds have model-fitting behavior in the server path. The Gemma 4 E4B NVIDIA forum log explicitly shows:

common_init_result: fitting params to device memory
llama_params_fit_impl: projected to use 5533 MiB of device memory vs. 6387 MiB of free device memory

Source:

Gemma4 E4B CUDA OOM thread

So I would let llama.cpp fit parameters when using the Jetson container, unless debugging a suspected fitting bug.

5. Do not blindly maximize GPU layers

-ngl 999 or --n-gpu-layers 99 can be fine when memory is enough. On Orin Nano 8GB it can also push the system over the edge.

Try:

-ngl auto

or step manually:

-ngl 8
-ngl 16
-ngl 24

If CPU-only works but GPU mode fails, the model may be too tight for the CUDA path even if the weights can be mapped.

6. Try MoE CPU expert offload only as an experiment

llama.cpp has MoE-related CPU offload options such as --cpu-moe / --n-cpu-moe in recent builds.

This can help on systems with real CPU RAM headroom, for example:

desktop PC with 32GB/64GB RAM
small GPU with large system RAM

On Orin Nano 8GB, the CPU and GPU share the same limited memory pool, so this is not a true escape hatch. Still, for MoE models, it may be worth testing:

-ncmoe 20

But I would not expect it to rescue an 8GB-class Mellum2 GGUF on an 8GB Jetson.

Reference:

llama.cpp server docs

7. Avoid `--mlock` on low-memory Jetson

--mlock can be useful on some systems, but on an 8GB Jetson it can make memory pressure worse by preventing paging.

In this case I would avoid:

--mlock

Reference:

llama.cpp server docs

8. Try `--no-mmap` only as a late diagnostic

mmap behavior can interact with page cache and memory pressure. I would keep the default first, then try:

--no-mmap

only if debugging load behavior.

Reference:

llama.cpp server docs

System-level memory tuning

Jetson AI Lab has a specific RAM optimization guide:

Jetson AI Lab — RAM Optimization

The jetson-containers setup guide also has practical advice on swap and disabling the desktop GUI:

jetson-containers setup guide

Disable the desktop GUI

If using the desktop environment, stopping it can free a meaningful amount of memory.

Temporary:

sudo init 3

Restore:

sudo init 5

For Orin Nano 8GB, even a few hundred MB can matter.

Add NVMe swap

Swap is not a performance solution, but it can prevent immediate OOM kills and help identify whether the failure is a hard fit problem.

Example:

sudo systemctl disable nvzramconfig
sudo fallocate -l 16G /mnt/16GB.swap
sudo mkswap /mnt/16GB.swap
sudo swapon /mnt/16GB.swap

Persist in /etc/fstab:

/mnt/16GB.swap none swap sw 0 0

Prefer NVMe over SD card for swap if available.

Clear caches before a load test

For repeatable testing:

sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

This does not fix the model, but it makes memory-pressure tests cleaner.

Use max power / clocks after the model actually fits

Power mode will not make an oversized model fit, but it matters once the model runs.

Check and set power mode:

sudo nvpmodel -q
sudo nvpmodel -m <mode>
sudo jetson_clocks

The exact mode depends on the installed JetPack / board configuration.

If building llama.cpp yourself

If not using the Jetson container, build with CUDA explicitly.

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=87

cmake --build build --config Release -j

Orin is compute capability 8.7 / SM87, and NVIDIA forum logs for Orin show ARCHS = 870.

References:

If experimenting with memory-constrained CUDA behavior, these build options may be worth knowing about:

-DGGML_CUDA_FORCE_MMQ=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON

But I would first test the NVIDIA container before going deep into custom builds.

A realistic Mellum2 last-try profile

If you still want to try Mellum2 just to see whether it can load, I would use an extreme low-memory profile.

This is a load experiment, not a practical recommendation:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m ./Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf \
  -c 256 \
  -ngl auto \
  -fa on \
  -ctk q4_0 \
  -ctv q4_0 \
  -b 32 \
  -ub 16 \
  --host 0.0.0.0 \
  --port 8080

If your llama.cpp build supports MoE CPU expert offload:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m ./Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf \
  -c 256 \
  -ngl auto \
  -fa on \
  -ctk q4_0 \
  -ctv q4_0 \
  -b 32 \
  -ub 16 \
  -ncmoe 20 \
  --host 0.0.0.0 \
  --port 8080

Again, I would not expect this to become a good daily-use setup. If it loads but runs very slowly, that is still useful information: it confirms that the bottleneck is not simply “unsupported model”, but practical memory and bandwidth limits.

Better model targets for this board

For Orin Nano 8GB, I would pick models by actual GGUF size and runtime stability, not leaderboard score alone.

Practical shortlist

Model family	Why it is more realistic than Mellum2 Q4	Suggested quant range
Gemma 4 E2B	Explicit Jetson AI Lab support; edge-first low-memory target.	Q4_K_S / Q4_K_M
Gemma 4 E4B	Supported on Jetson, but already near the upper edge on Nano 8GB.	Q3 / Q4, short context
Qwen 4B-class models	Good instruction/coding/tool-use balance if GGUF support is current.	Q4 / IQ4 / Q3
Qwen2.5-Coder-3B	Older but still a useful coding-specific baseline.	Q4 / Q5
StarCoder2-3B	Better for FIM/completion-style coding than chat-agent use.	Q4 / Q5
LFM2 / LFM2.5 8B-A1B	Edge-oriented MoE; interesting if Q3/IQ4 GGUF fits.	Q3 / IQ4
Granite Code 3B	Practical code model, enterprise-friendly posture.	Q4 / Q5
OpenCoder 1.5B/8B	Code-specialized candidate; 8B needs low-bit.	1.5B Q5/Q8, 8B Q3

Models I would not recommend first on Orin Nano 8GB

Model type	Why I would avoid it first
Mellum2 Q4_K_M	12B total MoE; Q4 file is too close to / above practical memory budget.
12B+ dense models	Even if they load, context and speed will likely be poor.
30B-A3B / 35B-A3B MoE	Interesting on 32GB RAM PCs, not on 8GB shared-memory Jetson.
Qwen3-Coder-Next 80B-A3B	Very interesting model, wrong memory class for this board.
Devstral-style 20B+ coding agents	Good benchmark story, wrong memory budget for Orin Nano 8GB.
Long-context runs	Context length will often fail before “model intelligence” matters.

Suggested debugging sequence

I would debug in this order.

Step	Goal	Command / action
1	Confirm JetPack/L4T	`cat /etc/nv_tegra_release`
2	Confirm available memory	`free -h`, `tegrastats`
3	Stop desktop GUI	`sudo init 3`
4	Add NVMe swap	16GB swap on NVMe if available
5	Test known-small GGUF	Gemma 4 E2B or a 2B–3B GGUF
6	Test 4B-class model	Qwen 4B / Gemma E4B / coder 3B
7	Tune ctx/KV/batch	`-c 512`, `-ctk q8_0`, `-b 64`, `-ub 32`
8	Try 8B low-bit	only after baseline is stable
9	Try Mellum2	only as a load experiment
10	Decide	if small models work but Mellum2 fails, the conclusion is memory budget, not setup failure

Example “known-good first” profile

sudo docker run -it --rm --pull always \
  --runtime=nvidia \
  --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server \
    -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S \
    -c 1024 \
    -ngl auto \
    -fa on \
    -ctk q8_0 \
    -ctv q8_0 \
    -b 64 \
    -ub 32 \
    --host 0.0.0.0 \
    --port 8080

If this works, move to a 3B/4B coding model.

If this fails, Mellum2 is not the right next test. Fix the Jetson runtime first.

Example “4B-class” profile

sudo docker run -it --rm --pull always \
  --runtime=nvidia \
  --network host \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \
  llama-server \
    -hf <repo>:<quant> \
    -c 512 \
    -ngl auto \
    -fa on \
    -ctk q8_0 \
    -ctv q8_0 \
    -b 64 \
    -ub 32 \
    --host 0.0.0.0 \
    --port 8080

If it fails:

-c 256
-ctk q4_0
-ctv q4_0
-b 32
-ub 16

If it only works with -c 256, it may technically run but not be useful for coding assistance.

How I would summarize the recommendation

For Orin Nano 8GB, I would frame it this way:

Mellum2 is an interesting coding MoE, but the Orin Nano 8GB memory budget is probably the wrong target for the Q4_K_M GGUF. The board has 8GB shared memory, so CPU offload is not the same escape hatch that it is on a 32GB RAM PC. I would first validate the Jetson stack with NVIDIA’s Jetson-oriented llama.cpp container and a known-small model such as Gemma 4 E2B. Then I would test 3B–4B coding/instruct models, or possibly 8B-A1B / 8B low-bit models. I would only try Mellum2 with very short context, quantized KV cache, tiny batch sizes, and possibly unified-memory fallback as a load experiment.

Resource list

Jetson / NVIDIA

Containers / setup

llama.cpp / GGUF

Mellum2

Model alternatives to check

Final practical take

I would not spend too much time trying to force Mellum2-12B-A2.5B-Instruct-Q4_K_M onto an Orin Nano 8GB.

I would do this instead:

Make sure JetPack/L4T is not on a known-problem release.
Use the Jetson-oriented llama.cpp container.
Disable GUI and add NVMe swap if possible.
Validate with Gemma 4 E2B or another 2B–3B GGUF.
Try 3B–4B coding models.
Try 8B low-bit / 8B-A1B edge-MoE models only after that.
Treat Mellum2 as an experiment, not the practical target.

For this board, the useful question is probably not:

“Can I run a 12B MoE somehow?”

It is more:

“Which 2B–4B or low-bit 8B model gives the best coding usefulness per GB of actual Jetson memory?”

OneBadUnit · June 3, 2026, 6:57pm

Hi! Thanks for the suggestions. I worked through them in order on a Jetson Orin Nano 8GB running llama.cpp.

System

Upgraded from JetPack/L4T 36.4.7 to 36.5.0
Verified 25W mode (nvpmodel)
Verified jetson_clocks
Monitored with tegrastats

Qwen2.5-Coder-7B-Instruct Q4_K_M

Loaded and ran normally
~11.2-11.3 tokens/sec generation
Performance appears consistent with reported Orin Nano results

Tests performed

-ngl 60, 80, 99: no meaningful difference
Context 1024 vs 2048: minimal difference
Flash Attention: no meaningful difference
Batch/microbatch tuning: no meaningful difference
KV cache q8_0: significantly slower
KV cache q4_0: degraded output quality

Memory observations

Qwen 7B used ~5.1 GB RAM under load
Modest swap usage
System remained stable

Granite/Mellum test results
Using ibm-granite/granite-4.0-h-small-GGUF:Q4_K_M:

-ngl 99: CUDA OOM, attempted ~18.6 GiB allocation
-ngl 20: CUDA OOM, attempted ~9.0 GiB allocation
-ngl 10: model loaded successfully and generated output

However, at -ngl 10 performance was extremely slow. Generation started eventually, but first-token latency was very long, memory usage was near system limits, swap was active, and even interrupting the process took noticeable time.

John6666 · June 3, 2026, 9:22pm

Oh. That is an important data point. I was estimating the headroom quite conservatively because Jetson uses shared CPU/GPU RAM, but it looks like there may be room for larger models than I expected:

Thanks for reporting the actual measurements. I would update my earlier take based on your Qwen2.5-Coder-7B-Instruct Q4_K_M result.

The important correction is:

7B Q4_K_M is clearly not out of range for Jetson Orin Nano 8GB.
It can be practical when the model architecture, GGUF size, context length, JetPack/L4T version, and llama.cpp backend are friendly.

Your Qwen test is especially useful because it is not just a “loaded once” result. It looks like a stable practical baseline:

Observation	What I think it means
`Qwen2.5-Coder-7B-Instruct Q4_K_M` loaded and ran normally	7B Q4 is a real target on this board, not just a theoretical fit.
~11.2–11.3 tok/s generation	Very usable for local coding assistance on a small edge board.
~5.1GB RAM under load	Consistent with a ~4.7GB GGUF plus runtime/KV/buffer overhead.
Modest swap usage, stable system	This does not sound like “barely alive by swapping”; it sounds usable.
`-ngl 60`, `80`, `99` made little difference	Once the important path is offloaded, decode may be memory-bandwidth/runtime limited rather than layer-count limited.
Context `1024` vs `2048` made little difference	Qwen2.5 7B has GQA and a relatively small KV cache at ordinary chat lengths.
Flash Attention made little difference	At short/medium context, attention may not be the dominant cost.
Batch/microbatch tuning made little difference	Mostly affects prompt processing, not necessarily steady-state decode.
KV cache `q8_0` was significantly slower	KV quantization is not automatically faster; it can hit slower kernels or dequant overhead.
KV cache `q4_0` degraded quality	Useful as an emergency memory lever, not necessarily a good default when the model already fits.

So I would now treat Qwen2.5-Coder-7B-Instruct Q4_K_M as the known-good coding baseline for this device.

Revised practical tiering

My previous framing was too conservative if read as “Jetson Nano 8GB can only really do 2B–4B”. Based on your Qwen result, I would revise the tiers like this:

Tier	Orin Nano 8GB interpretation
2B–4B Q4/Q5	Safe baseline. Good first test for JetPack/container/llama.cpp sanity.
7B Q4_K_M	Practical. Your Qwen2.5-Coder result proves this can be a real target.
8B Q4 / 8B low-bit	Worth testing carefully. Architecture and GGUF size matter a lot.
9B Q4	Possible but more aggressive; likely sensitive to context/runtime/settings.
12B MoE Q4	Mostly experimental on this board. Active parameters can be misleading.
30B+ total MoE / 32B total models	Not a practical 8GB shared-memory Jetson target unless the goal is boundary testing.

For a normal 8GB discrete GPU, 7B–8B Q4_K_M models are already a common practical target. The Jetson Orin Nano 8GB is a special case because its 8GB LPDDR5 is shared by CPU and GPU, so it is not the same as “8GB VRAM plus separate system RAM”.

NVIDIA lists the Orin Nano Super Developer Kit as 8GB 128-bit LPDDR5, 102GB/s, with 1024 CUDA cores, 32 Tensor Cores, and 7W–25W power modes: Jetson Orin Nano Super Developer Kit.

That still means shared-memory constraints are real. But your result shows that the practical ceiling is not as low as I first implied.

Why Qwen2.5-Coder-7B had more headroom than expected

I think the result makes sense for several reasons.

1. The actual Q4_K_M GGUF is only about 4.68GB

The Qwen2.5-Coder-7B-Instruct Q4_K_M file is about 4.68GB:

Qwen/Qwen2.5-Coder-7B-Instruct-GGUF — Q4_K_M file

That leaves some room for:

llama.cpp runtime buffers
CUDA allocations
KV cache
OS/services
mmap/page-cache behavior
server overhead

A rough mental model is:

Qwen2.5-Coder-7B Q4_K_M weights:  ~4.68GB
KV cache at normal chat lengths:   relatively small
runtime/CUDA/server overhead:      additional memory
observed under load:               ~5.1GB RAM

So the ~5.1GB RAM observation is quite plausible. This model is not close to the same memory class as an 8GB-class GGUF or a 19GB-class GGUF.

2. Qwen2.5 7B uses GQA, so ordinary-context KV cache is small

Qwen2.5 7B uses Grouped-Query Attention. Its model card lists:

28 layers
28 query heads
4 key/value heads
GQA
RoPE
SwiGLU
RMSNorm
QKV bias

Source:

Qwen/Qwen2.5-7B-Instruct

The 4 KV heads matter. With GQA, the model stores fewer KV heads than ordinary full multi-head attention. That keeps KV cache smaller.

A rough f16 KV cache estimate for Qwen2.5 7B is:

K and V tensors
× 4 KV heads
× 128 head_dim
× 2 bytes per f16 value
× 28 layers
= 57,344 bytes/token
≈ 56 KiB/token

Approximate KV cache size:

Context	Approx. f16 KV cache
512	~28MB
1024	~56MB
2048	~112MB
4096	~224MB
8192	~448MB

This is small compared with the 4.68GB weight file. That explains why ctx 1024 vs 2048 did not change much.

It also explains why KV quantization was not automatically helpful. If KV cache is already small, quantizing it saves little absolute memory, while it may introduce slower kernels, dequant overhead, or output-quality loss.

Relevant references:

3. Qwen2.5-Coder-7B is a mature dense-transformer path

Qwen2.5-Coder-7B-Instruct is a code-specialized dense model in the Qwen2.5-Coder family. The technical report describes the Qwen2.5-Coder series as including 0.5B, 1.5B, 3B, 7B, 14B, and 32B models, with continued pretraining on more than 5.5T tokens and evaluations across code generation, completion, reasoning, and repair.

References:

This matters because Qwen2.5-Coder-7B is not just a random 7B chat model. It is a strong code-specialized model that happens to fit into a realistic GGUF size for this board.

That makes it a very good baseline.

Why Granite H-Small behaved so differently

Your Granite result also makes sense, but I would not compare granite-4.0-h-small to the Qwen 7B run as if they were adjacent model sizes.

IBM’s naming is a little easy to misread here. Granite 4.0 H-Small is not a small 7B-class model. IBM describes Granite 4.0 H-Small as a 32B total / 9B active hybrid MoE model, while H-Tiny is 7B total / 1B active and H-Micro/Micro are 3B-class models.

References:

So the granite-4.0-h-small result is consistent with expectations:

Model	Practical memory class
`Qwen2.5-Coder-7B Q4_K_M`	~4.68GB dense GGUF; fits with useful headroom.
`Granite 4.0 H-Small Q4_K_M`	32B total / 9B active hybrid model; not a 7B-class target.
`Mellum2-12B-A2.5B Q4_K_M`	12B total MoE; active 2.5B does not remove weight-residency cost.

Your Granite observations are exactly what I would expect from a model in the wrong memory class for this board:

Granite H-Small result	Interpretation
`-ngl 99`: CUDA OOM, attempted ~18.6GiB allocation	Full/near-full offload is impossible on 8GB shared RAM.
`-ngl 20`: CUDA OOM, attempted ~9.0GiB allocation	Still above the practical device-memory budget.
`-ngl 10`: loaded but extremely slow	Technically possible to start, but the system is near limits and paging/offload/latency dominate.

For Granite on Orin Nano 8GB, I would test smaller variants instead:

Granite model	Why it is more relevant
`granite-4.0-h-micro-GGUF`	3B hybrid model; safer Jetson memory class.
`granite-4.0-micro-GGUF`	3B conventional transformer option.
`granite-4.0-h-tiny-GGUF`	7B total / 1B active; much more comparable to your successful Qwen 7B run.
`granite-4.0-h-small-GGUF`	32B total / 9B active; useful boundary test, not a practical Nano 8GB target.

Relevant model links:

Why Mellum2 is still a different case from Qwen2.5-Coder-7B

Mellum2 remains interesting, but I would still not put it in the same class as Qwen2.5-Coder-7B on this board.

Mellum2 is a 12B total / 2.5B active MoE model. The technical report describes 64 experts, 8 active experts, GQA, sliding-window attention, and a multi-token prediction head.

References:

The crucial point is:

Active parameters reduce per-token compute, but total resident weights still matter.

So even if Mellum2 runs with 2.5B active parameters per token, it is still a 12B-total MoE model for weight storage and runtime layout. That makes it very different from a 4.68GB dense Qwen2.5-Coder-7B GGUF.

I would now phrase Mellum2 like this:

Question	Updated answer
Is Mellum2 impossible?	Not necessarily. It is worth experimenting with if the goal is boundary testing.
Is Mellum2 Q4_K_M a practical recommendation for Orin Nano 8GB?	I still doubt it.
Does Qwen2.5-Coder-7B success imply Mellum2 should also work?	No. The memory class and architecture are different.
What would make Mellum2 more interesting?	A high-quality lower-bit GGUF, careful MoE offload behavior, and very short context tests.

JetPack/L4T probably mattered too

Your upgrade from JetPack/L4T 36.4.7 to 36.5.0 is also important.

There is a related NVIDIA forum thread where Gemma4 E4B failed with CUDA OOM on Orin Nano, and NVIDIA later confirmed it working on r36.5 / JetPack 6.2.2. The thread also mentions a known memory issue in r36.4.7 that was fixed in r36.5.

Reference:

Gemma4 E4B on Jetson Orin Nano fails due to CUDA out of memory issue

So part of the Qwen success may be that you were no longer testing on a release with a known memory problem.

Useful checks for future posts:

cat /etc/nv_tegra_release
dpkg-query --show nvidia-l4t-core
sudo nvpmodel -q
tegrastats

What I would test next

Given your result, I would not spend most of the time trying to force Mellum2 first. I would use your Qwen result as the reference point and compare other models against it.

Practical next candidates

Priority	Candidate	Reason
1	Keep `Qwen2.5-Coder-7B-Instruct Q4_K_M`	Known-good coding baseline on this exact board.
2	`Qwen2.5-Coder-7B` other quants	Compare Q4_K_M vs Q5_K_M or IQ4/IQ3 if available.
3	Qwen 7B/8B-ish newer coder/instruct GGUFs	Same broad architecture family may preserve good runtime behavior.
4	`Granite 4.0 H-Tiny`	7B total / 1B active; much more relevant than H-Small.
5	`Granite 4.0 Micro` / `H-Micro`	3B-class safe Granite tests.
6	LFM2 / LFM2.5 8B-A1B low-bit	Interesting edge-MoE class, but should be tested against the Qwen baseline.
7	StarCoder2-3B / OpenCoder / Granite Code 3B	Useful for code completion or smaller coding tasks.

Models I would keep as boundary tests

Model	Why
Mellum2 Q4_K_M	Interesting 12B MoE, but still likely too large for comfortable Nano 8GB use.
Granite 4.0 H-Small	32B total / 9B active. Your results already show the boundary.
Qwen 30B-A3B / 35B-A3B MoE	Interesting on 32GB RAM PCs, not this board.
Qwen3-Coder-Next 80B-A3B	Very interesting model class, wrong memory class for Orin Nano 8GB.

Updated conclusion

My revised interpretation is:

Orin Nano 8GB has enough practical headroom for well-behaved 7B Q4_K_M models.
The successful Qwen2.5-Coder-7B run shows that clearly.
But that extra headroom does not automatically extend to every active-small MoE model, because total GGUF size, resident weights, KV layout, architecture, and JetPack/L4T behavior still dominate.

So I would no longer say “stay mostly under 4B” for coding models on this board.

I would say:

2B–4B is the safe zone.
7B Q4_K_M is now proven practical by your Qwen result.
8B low-bit / 8B Q4 is worth exploring.
9B Q4 is possible but aggressive.
12B MoE Q4 is still mostly experimental.
32B/9B-active or 30B+ MoE is not a practical Nano 8GB target.

The most useful takeaway for me is:

Use actual GGUF size and architecture, not just parameter count or active parameter count.

That explains all three results:

Model	Result	Likely reason
`Qwen2.5-Coder-7B Q4_K_M`	Practical	~4.68GB dense GGUF, GQA, small KV, mature runtime path.
`Granite 4.0 H-Small Q4_K_M`	OOM or extremely slow	32B total / 9B active hybrid model; wrong memory class.
`Mellum2-12B-A2.5B Q4_K_M`	Still doubtful	12B total MoE; active 2.5B does not make it behave like a 2.5B dense model in memory.

OneBadUnit · June 4, 2026, 8:38pm

Thanks for the detailed explanation. I’m still pretty new to local AI and Jetson hardware, so a lot of this has been me experimenting, learning, and trying to figure out what actually works versus what should work on paper.

The Qwen 7B result surprised me. Before these tests I would have assumed a 7B model was pushing the limits of an Orin Nano 8GB, but it’s been stable and usable enough that I’m now using it as my baseline.

Your explanation also helped me understand why Granite H-Small and Mellum2 aren’t directly comparable to Qwen, even though the active parameter counts can make them look similar at first glance.

At this point I’m less focused on forcing Mellum2 to work and more interested in understanding where the practical limits of the hardware really are. I may still try a few of the smaller Granite variants and other models just for comparison and to gather more data.

Either way, I appreciate the time you took to write all of that up. I’ve learned quite a bit from this little experiment.

Topic		Replies	Views
Running 8B Llama on Jetson Orin Nano (using only 2.5GB of GPU memory) Show and Tell	0	147	March 11, 2026
Are there any LLMs that can run with decent performance on hardware comparable to Jetson NX? Models	2	74	March 16, 2026
Meta-Llama-3.1-70B-Instruct-IMat-GGUF Beginners	0	178	July 24, 2024
RAM usage, Model streaming or alternatives Beginners	4	1285	March 1, 2026
Best open-source model for parsing messy PDFs on 16GB RAM (CPU only) Models	25	4339	October 24, 2025

Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB

Short version

Why this is hard on Orin Nano 8GB

Mellum2-specific issue

Practical interpretation of NVIDIA/Jetson guidance

First thing to check: JetPack / L4T version

Recommended baseline: use the Jetson-oriented llama.cpp container first

Runtime tuning checklist

1. Reduce context length aggressively

2. Reduce batch and micro-batch

3. Quantize the KV cache

4. Use -fit / fitting mode when available

5. Do not blindly maximize GPU layers

6. Try MoE CPU expert offload only as an experiment

7. Avoid --mlock on low-memory Jetson

8. Try --no-mmap only as a late diagnostic

System-level memory tuning

Disable the desktop GUI

Add NVMe swap

Clear caches before a load test

Use max power / clocks after the model actually fits

If building llama.cpp yourself

A realistic Mellum2 last-try profile

Better model targets for this board

Practical shortlist

Models I would not recommend first on Orin Nano 8GB

Suggested debugging sequence

Example “known-good first” profile

Example “4B-class” profile

How I would summarize the recommendation

Resource list

Jetson / NVIDIA

Containers / setup

llama.cpp / GGUF

Mellum2

Model alternatives to check

Final practical take

Revised practical tiering

Why Qwen2.5-Coder-7B had more headroom than expected

1. The actual Q4_K_M GGUF is only about 4.68GB

2. Qwen2.5 7B uses GQA, so ordinary-context KV cache is small

3. Qwen2.5-Coder-7B is a mature dense-transformer path

Why Granite H-Small behaved so differently

Why Mellum2 is still a different case from Qwen2.5-Coder-7B

JetPack/L4T probably mattered too

What I would test next

Practical next candidates

Models I would keep as boundary tests

Updated conclusion

Related topics

4. Use `-fit` / fitting mode when available

7. Avoid `--mlock` on low-memory Jetson

8. Try `--no-mmap` only as a late diagnostic