--- base_model: - ibm-granite/granite-4.1-8b datasets: - eaddario/imatrix-calibration language: - en license: - apache-2.0 pipeline_tag: text-generation tags: - gguf - quant - target_bpw - experimental --- # Experimental global target bits‑per‑weight quantization of [ibm-granite/granite-4.1-8b](https://huggingface.co/ibm-granite/granite-4.1-8b) Using **non-standard** (forked) [LLaMA C++][llm] release [b9358][llm-rel] for quantization. Original model: [ibm-granite/granite-4.1-8b][mdl] From the original model creators: > [![mof-class3-qualified](https://mot.isitopen.ai/modules/mof/assets/badge_class3_qualified.png)](https://mot.isitopen.ai/model/1160) > > # Granite-4.1-8B > > **Model Summary:** > Granite-4.1-8B is a 8B parameter long-context instruct model finetuned from *Granite-4.1-8B-Base* using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. Granite 4.1 models have gone through an improved post-training pipeline, including supervised finetuning and reinforcement learning alignment, resulting in enhanced tool calling, instruction following, and chat capabilities. > > - **Developers:** Granite Team, IBM > - **HF Collection:** [Granite 4.1 Language Models HF Collection](https://huggingface.co/collections/ibm-granite/granite-41-language-models) > - **Technical Blog:** [Granite-4.1 Blog](https://huggingface.co/blog/ibm-granite/granite-4-1) > - **GitHub Repository:** [ibm-granite/granite-4.1-language-models](https://github.com/ibm-granite/granite-4.1-language-models) > - **Website**: [Granite Docs](https://www.ibm.com/granite/docs/) > - **Release Date**: April 29th, 2026 > - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) > > **Supported Languages:** > English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.1 models for languages beyond these languages. > > **Intended use:** > The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities. --- # ⚠️ PLEASE READ THIS BEFORE USING THESE EXPERIMENTAL VERSIONS! ⚠️ An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning. The method to produce these experimental versions involves using a custom version of [`llama-imatrix`][imx] to generate an imatrix that includes tensor statistics, and a custom version of [`llama-quantize`][qtz], which computes a per-tensor quantization error, to automatically select the lowest error quantization recipe that achieves a global target bits‑per‑weight (bpw). More details on the implementation and test results [here][bpw] There are two pull requests ([#14891][imtx-pr] & [#15550][qtz-pr]) to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified versions will be available on [GitHub][gh]. For testing and comparison, I use models produced by [Bartowski][btk] (see credits below) and [Unsloth][ust] ([Daniel and Michael Han][ust-ai] do some really interesting stuff!) but when they don't provide versions of the required model, tests and comparisons are against standard quantization obtained by simply running `llama-quantize` with no further optimizations. All experimental versions were generated using an appropriate imatrix created from datasets available at [eaddario/imatrix-calibration][ical]. In `llama.cpp`, an imatrix is a calibration file derived from running representative text through the model and collecting activation statistics. It is used to weight quantization error so that error in more “important” directions (as estimated from activations) is penalized more heavily. The process to generate these models is roughly as follows: 1. Convert the original model's [safetensors][sfts] to [GGUF][ggf] F16 2. Estimate the [Perplexity][ppl] score for the F16 model (baseline) using the [wikitext-2-raw-v1][wki-dat] dataset, and save the [logits][lgt] 3. Generate an [imatrix][imx-dat] from the most appropriate [calibration dataset][ical] 4. Quantize the baseline model targeting a bpw average (e.g. `llama-quantize --target-bpw 4.5678 --state-file --imatrix imatrix.gguf baseline-model-F16.gguf 12`) 5. Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), GPQA-Diamond, HellaSwag, MMLU-Redux, Truthful QA and WinoGrande scores for each quantized model 6. Keep version with the best 𝜌PPL and μKLD scores 7. Repeat until all desired quants are created ### Misconceptions about BF16 to F16 Conversion A common concern when converting BFloat16 ([BF16][bf16]) models to Float16 (F16) is the potential for accuracy loss. Specifically: - Weight Clipping (Overflow): Clipping, or overflow, is often feared but only occurs if a model's weights exceed the range of ±65,503. This is a relatively rare issue in practice. - Subnormal Zeroing (Underflow): A more frequent occurrence is underflow, where weights smaller than approximately 5.96x10⁻⁸ are converted to zero. Crucially, when the F16 model is subsequently used for quantization, the resulting degradation in metrics like Perplexity ([PPL][ppl]) or Kullback–Leibler Divergence ([KLD][kld]) is minimal. Any variations are typically restricted to the hundreds or thousandths decimal places compared to the BF16 model. However, considering that weight clipping presents a more substantial risk to model integrity, every BF16 base model undergoes validation prior to the conversion process. Consequently, no models hosted in this repository exhibit performance degradation due to overflow clipping. While BF16 offers precision benefits, performance remains a key factor. - Conversion Speed: Tests, such as timing `convert_hf_to_gguf.py`, show a notable performance difference, with conversion to BF16 being 15–30% slower than to F16. - Inference Speed: A less pronounced but still present difference (3–6%) is observed during inference. Although native BF support has been introduced by many chip manufacturers, the slower performance **may** stem from the entire software and hardware stack (firmware, libraries, etc.) not being fully optimized yet. The choice to prioritize F16 over BF16 is driven by a focus on maximizing performance in specific deployment environments. My primary objective is not large-scale quantization production, a domain where others like [Bartowski][btk] and [Unsloth][ust] excel at, but rather optimizing inference performance for resource-constrained environments. Since BF16 support is not yet widespread in areas like mobile, edge, and embedded devices, using F16 ensures broader compatibility and easier optimization for these use cases. # Advantages and disadvantages of the global target bits‑per‑weight quantization process ### Advantages 1. **Target arbitrary size models** - When specifying `--target-bpw 4.5678` for instance, the algorithm will produce a model (nearly) exactly of that size, which is very useful for maximizing VRAM usage. In a system with 24GB VRAM and a 70B model, standard quants might produce a 16.8GB file (too small, quality left on table) or a 24.1GB file (won't fit). This approach can generate a 23.85GB file to utilize the hardware fully. 2. **Data-driven mixed precision often can improve quality at fixed size** - Instead of using hardcoded heuristics (e.g. make `attn_v` Q5_K for a 70B model), that may be sub‑optimal for a given architecture or size, the quantization mix is determined by the actual error sensitivity of the specific model's weights. This, in practice, often yields a better quality/size trade-off, especially in aggressive quantization scenarios (1.5 to 3.5 bpw), or for unusual architectures. - **Please note**: `llama.cpp`’s heuristics have been tuned across many models and are highly optimized; although the target bpw method produces better quality often (>75% based on tests with 130 models from 11 different families), it can also lose in surprising cases. 3. **Allows better like-for-like comparisons between models and families** - Standard `llama.cpp` quantization uses hardcoded rules like: *"use Q4_K_M, except bump some tensors up/down, except fall back if incompatible, except keep some tensors unquantized..."* and for that reason, two different models quantized with the same Q4_K_M type can end up with very different bpw (e.g. 4.75 and 4.30). - All things being equal, the performance of a model is usually proportional to its overall bpw size; models with a higher bpw tend to perform better than lower bpw models. Since model A has simply been given more bits, it will typically perform better (lower perplexity, better eval scores, etc.) even if the underlying quantization method is identical. That makes comparing the performance not a controlled experiment, because the comparison is between models with different effective compression ratios. - `--target-bpw` tries to address that by making the experiment more controlled: each model gets quantized to land on (approximately) the same global byte budget, so that the models' performance differences are more attributable to architecture/training differences, quantization error behaviour at the same compression ratio, optimizer’s allocation decisions, etc. ### Disadvantages 1. **Quantization process is significantly slower than standard** - This approach can take 5x-10x longer as it quantizes a sample of most tensors into 15 different formats, dequantizes them back to floats, computes error diffs, and selects the best size/error option that fits the global bpw budget. - However, the `--state-file` option will save/use the above-mentioned computations so that future quantizations, for the same model, can be generated at normal speed. It also allows to interrupt the computation process and resume it at a later time. 2. **The optimization target is only a proxy for the model's performance quality** - The process minimizes a per-tensor estimated error computed from sampled rows, not actual perplexity or divergence of output distributions (a future version may address this). Since errors interact nonlinearly across layers, there are no guarantees it will select the best possible quantization recipe subject to the bpw size constraint. 3. **An imatrix with activations data is required for best results** - Activation data is required to compute the bias factor (i.e. the systematic error projected onto activation directions). If the imatrix file does not contain activation data, the `--target-bpw` option will refuse to run. --- # Models ### Bits per weight, size, perplexity and KL Divergence scores | Model | BPW | Size (GB) | μPPL | 𝜌PPL | μKLD | Same Top-P | | ------------------------------------------------- | ------: | --------: | ------------------: | -----: | -----------------: | ------------: | | [granite-4.1-8b-F16](./granite-4.1-8b-F16.gguf) | 16.0006 | 17.6 | 8.691178 ±0.065443 | 100% | N/A | N/A | | [granite-4.1-8b-Q2_K](./granite-4.1-8b-Q1_L.gguf) | 1.7500 | 1.93 | 87.318832 ±0.781580 | 57.61% | 2.889523 ±0.005948 | 34.309 ±0.125 | | [granite-4.1-8b-Q2_K](./granite-4.1-8b-Q2_K.gguf) | 2.5000 | 2.75 | 12.534216 ±0.095606 | 86.12% | 0.644965 ±0.002755 | 67.231 ±0.124 | | [granite-4.1-8b-Q3_K](./granite-4.1-8b-Q3_K.gguf) | 3.5000 | 3.85 | 9.381594 ±0.070128 | 96.18% | 0.173887 ±0.001079 | 82.732 ±0.100 | | [granite-4.1-8b-Q4_K](./granite-4.1-8b-Q4_K.gguf) | 4.4999 | 4.95 | 8.867438 ±0.067303 | 98.88% | 0.047917 ±0.000392 | 90.937 ±0.076 | | [granite-4.1-8b-Q5_K](./granite-4.1-8b-Q5_K.gguf) | 5.4999 | 6.05 | 8.766150 ±0.066421 | 99.48% | 0.018940 ±0.000165 | 94.120 ±0.062 | | [granite-4.1-8b-Q6_K](./granite-4.1-8b-Q6_K.gguf) | 6.4998 | 7.15 | 8.755199 ±0.066400 | 99.74% | 0.007326 ±0.000066 | 96.165 ±0.051 | | [granite-4.1-8b-Q7_K](./granite-4.1-8b-Q7_K.gguf) | 7.4998 | 8.25 | 8.751241 ±0.066500 | 99.82% | 0.003568 ±0.000040 | 97.235 ±0.043 | | [granite-4.1-8b-Q8_0](./granite-4.1-8b-Q8_0.gguf) | 8.4988 | 9.34 | 8.749119 ±0.066517 | 99.85% | 0.002052 ±0.000024 | 97.749 ±0.039 | ### ARC, GPQA-Diamond, HellaSwag, MMLU-Redux, Truthful QA, and WinoGrande scores Scores generated using [llama-perplexity][ppl] with 750 tasks per test, and a context size of 1024 tokens. For the test data used in the generation of these scores, follow the appropriate links: [ARC Challenge, Truthful QA][tst-dat], [GPQA-Diamond][gpqa-dat], [HellaSwag][hsw-tst], [MMLU-Redux][mrdx], [WinoGrande][wng-tst] | Model | ARC Challenge | GPQA-Diamond | HellaSwag | MMLU-Redox | Truthful QA | WinoGrande | Avg Score | | ------------------------------------------------- | --------------: | --------------: | --------: | --------------: | --------------: | --------------: | --------: | | [granite-4.1-8b-Q1_L](./granite-4.1-8b-Q1_L.gguf) | 36.5333 ±1.7594 | 19.1919 ±2.8058 | 36.00 | 27.2000 ±1.6260 | 28.9333 ±1.6569 | 52.5333 ±1.8246 | 33.40 | | [granite-4.1-8b-Q2_K](./granite-4.1-8b-Q2_K.gguf) | 60.4000 ±1.7870 | 29.7980 ±3.2586 | 70.00 | 59.2000 ±1.7958 | 33.4667 ±1.7242 | 65.2000 ±1.7405 | 53.01 | | [granite-4.1-8b-Q3_K](./granite-4.1-8b-Q3_K.gguf) | 62.0000 ±1.7736 | 21.7172 ±2.9377 | 79.33 | 69.2000 ±1.6869 | 39.6000 ±1.7870 | 71.7333 ±1.6453 | 57.26 | | [granite-4.1-8b-Q4_K](./granite-4.1-8b-Q4_K.gguf) | 66.9333 ±1.7190 | 23.2323 ±3.0089 | 79.73 | 71.4667 ±1.6500 | 38.9333 ±1.7816 | 73.4667 ±1.6132 | 58.96 | | [granite-4.1-8b-Q5_K](./granite-4.1-8b-Q5_K.gguf) | 66.4000 ±1.7259 | 22.7273 ±2.9858 | 79.87 | 72.1333 ±1.6382 | 38.5333 ±1.7783 | 73.4667 ±1.6132 | 58.86 | | [granite-4.1-8b-Q6_K](./granite-4.1-8b-Q6_K.gguf) | 67.0667 ±1.7172 | 24.7475 ±3.0746 | 80.13 | 72.6667 ±1.6284 | 38.2667 ±1.7759 | 73.7333 ±1.6080 | 59.44 | | [granite-4.1-8b-Q7_K](./granite-4.1-8b-Q7_K.gguf) | 66.4000 ±1.7259 | 26.7677 ±3.1544 | 80.27 | 72.1333 ±1.6382 | 38.5333 ±1.7783 | 73.6000 ±1.6106 | 59.62 | | [granite-4.1-8b-Q8_0](./granite-4.1-8b-Q8_0.gguf) | 66.8000 ±1.7207 | 26.7677 ±3.1544 | 80.53 | 72.4000 ±1.6334 | 38.4000 ±1.7771 | 73.2000 ±1.6184 | 59.68 | ### Tokens per second benchmarks Scores generated using [llama-bench][bch]. Standard (`llama-quantize` with no optimization) Q4_K_M quantization included for comparison. | model | size | params | backend | threads | test | t/s | | ------------------------------------------------- | -------: | -----: | -------- | ------: | ------------: | ------------: | | [granite-4.1-8b-Q1_L](./granite-4.1-8b-Q1_L.gguf) | 1.79 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 783.11 ±0.52 | | [granite-4.1-8b-Q1_L](./granite-4.1-8b-Q1_L.gguf) | 1.79 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 68.68 ±0.17 | | [granite-4.1-8b-Q1_L](./granite-4.1-8b-Q1_L.gguf) | 1.79 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 108.35 ±1.28 | | [granite-4.1-8b-Q2_K](./granite-4.1-8b-Q2_K.gguf) | 2.56 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 728.97 ±10.22 | | [granite-4.1-8b-Q2_K](./granite-4.1-8b-Q2_K.gguf) | 2.56 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 68.76 ±0.21 | | [granite-4.1-8b-Q2_K](./granite-4.1-8b-Q2_K.gguf) | 2.56 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 108.98 ±0.24 | | [granite-4.1-8b-Q3_K](./granite-4.1-8b-Q3_K.gguf) | 3.58 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 733.45 ±9.51 | | [granite-4.1-8b-Q3_K](./granite-4.1-8b-Q3_K.gguf) | 3.58 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 63.63 ±1.20 | | [granite-4.1-8b-Q3_K](./granite-4.1-8b-Q3_K.gguf) | 3.58 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 94.51 ±1.15 | | [granite-4.1-8b-Q4_K](./granite-4.1-8b-Q4_K.gguf) | 4.61 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 771.63 ±0.97 | | [granite-4.1-8b-Q4_K](./granite-4.1-8b-Q4_K.gguf) | 4.61 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 66.33 ±1.24 | | [granite-4.1-8b-Q4_K](./granite-4.1-8b-Q4_K.gguf) | 4.61 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 105.98 ±4.76 | | [granite-4.1-8b-Q5_K](./granite-4.1-8b-Q5_K.gguf) | 5.63 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 673.26 ±34.19 | | [granite-4.1-8b-Q5_K](./granite-4.1-8b-Q5_K.gguf) | 5.63 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 51.29 ±3.09 | | [granite-4.1-8b-Q5_K](./granite-4.1-8b-Q5_K.gguf) | 5.63 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 83.45 ±2.31 | | [granite-4.1-8b-Q6_K](./granite-4.1-8b-Q6_K.gguf) | 6.65 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 703.41 ±23.92 | | [granite-4.1-8b-Q6_K](./granite-4.1-8b-Q6_K.gguf) | 6.65 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 52.12 ±1.38 | | [granite-4.1-8b-Q6_K](./granite-4.1-8b-Q6_K.gguf) | 6.65 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 87.04 ±0.22 | | [granite-4.1-8b-Q7_K](./granite-4.1-8b-Q7_K.gguf) | 7.68 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 614.53 ±0.48 | | [granite-4.1-8b-Q7_K](./granite-4.1-8b-Q7_K.gguf) | 7.68 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 49.47 ±0.59 | | [granite-4.1-8b-Q7_K](./granite-4.1-8b-Q7_K.gguf) | 7.68 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 83.45 ±0.24 | | [granite-4.1-8b-Q8_0](./granite-4.1-8b-Q8_0.gguf) | 8.70 GiB | 8.79 B | BLAS,MTL | 12 | pp512 | 800.32 ±0.73 | | [granite-4.1-8b-Q8_0](./granite-4.1-8b-Q8_0.gguf) | 8.70 GiB | 8.79 B | BLAS,MTL | 12 | tg128 | 46.66 ±0.04 | | [granite-4.1-8b-Q8_0](./granite-4.1-8b-Q8_0.gguf) | 8.70 GiB | 8.79 B | BLAS,MTL | 12 | pp1024+tg1024 | 77.87 ±0.30 | # Metrics used **[Perplexity][ppx]:** one of the key metrics used in NLP evaluation. It measures the quality of a language model by evaluating how well it predicts the next token given a particular sequence of words. A PPL of **1** indicates an exact match between predicted and actual, whereas values greater than one indicate a degree of "surprise" the generated token differs from the expected. **[Kullback–Leibler (KL) Divergence][kld]:** a statistical measure of how much a probability distribution differs from another. When quantizing models (or altering the original tensors in any way for that matter), the closest we can preserve the weights' probability distribution to the original model the better, thus the closest to **0** the better. **[AI2 Reasoning Challenge (ARC)][arc]:** a benchmark to evaluate the ability of AI models to answer complex science questions that require logical reasoning beyond pattern matching. **[GPQA-Diamond][gpqa]:** a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. **[HellaSwag][hsw]:** the Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations (bit of a mouthful!) is a benchmark designed to test commonsense natural language inference. It requires the model to predict the most likely ending of a sentence. **[MMLU][mmlu]:** the Massive Multitask Language Understanding evaluates LLMs’ general knowledge and problem-solving abilities across 57 subjects, including elementary mathematics, US history, computer science, and law. **[Truthful QA][tqa]:** evaluates how well LLMs generate truthful responses to questions. It identifies whether AI models can avoid generating false or misleading information, particularly in areas where human knowledge is prone to misconceptions. **[Winogrande][wng]:** based on the [Winograd Schema Challenge][wng-chl], is a natural language understanding task requiring models to resolve ambiguities in sentences involving pronoun references. ## Credits [LLaMa C++][llm] has a large and vibrant community of [contributors][llm-ctt] (~1,600 last time I checked) that actively maintain and extend its functionality, adding new models and architectures almost as fast as they appear. Considering the breakneck speed at which the AI/ML field is advancing, this alone is a remarkable feat! While I'm grateful to all contributors, I want to recognise three in particular: * [Colin Kealty][btk] (Bartowski), for the many contributions and for being one of the best sources of high quality quantized models available on Hugging Face * [Georgi Gerganov][ggg] for his amazing work with **llama.cpp** and the **ggml/gguf** libraries * [Iwan Kawrakow][ikk] for being one of the key authors behind the many quantization algorithms and the imatrix functionality. [arc]: https://llm-stats.com/benchmarks/ai2-reasoning-challenge-(arc) [base]: https://huggingface.co/ibm-granite/granite-4.1-8b [b-q4km]: https://huggingface.co/bartowski [bch]: https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench [bf16]: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format [bpw]: https://github.com/ggml-org/llama.cpp/discussions/18531 [btk]: https://huggingface.co/bartowski [ggf]: https://huggingface.co/docs/hub/en/gguf [ggg]: https://github.com/ggerganov [gh]: https://github.com/EAddario/llama.cpp/tree/master [gpqa]: https://arxiv.org/abs/2311.12022 [gpqa-dat]: https://huggingface.co/datasets/eaddario/benchmark [hsw-tst]: https://github.com/klosax/hellaswag_text_data [hsw]: https://rowanzellers.com/hellaswag [ical]: https://huggingface.co/datasets/eaddario/imatrix-calibration [ikk]: https://github.com/ikawrakow [imtx-pr]: https://github.com/ggml-org/llama.cpp/pull/14891 [imx-dat]: https://huggingface.co/eaddario/granite-4.1-8b-GGUF/tree/main/imatrix [imx]: https://github.com/EAddario/llama.cpp/tree/imatrix [kld]: https://en.wikipedia.org/wiki/Kullback–Leibler_divergence [lgt]: https://huggingface.co/eaddario/granite-4.1-8b-GGUF/tree/main/logits [llm-ctt]: https://github.com/ggml-org/llama.cpp/graphs/contributors [llm-rel]: https://github.com/ggml-org/llama.cpp/releases/tag/b9358 [llm]: https://github.com/ggerganov/llama.cpp [mdl]: https://huggingface.co/ibm-granite/granite-4.1-8b [mmlu]: https://en.wikipedia.org/wiki/MMLU [mrdx]: https://huggingface.co/datasets/Green-Sky/mmlu-redux-2.0-for-llama.cpp [ppl]: https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity [ppx]: https://huggingface.co/docs/transformers/en/perplexity [qtz-pr]: https://github.com/ggml-org/llama.cpp/pull/15550 [qtz]: https://github.com/EAddario/llama.cpp/tree/quantize [sfts]: https://huggingface.co/docs/safetensors/en/index [tqa]: https://github.com/sylinrl/TruthfulQA [tst-dat]: https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/tree/main [u-q4km]: https://huggingface.co/unsloth [ust-ai]: https://unsloth.ai [ust]: https://huggingface.co/unsloth [wki-dat]: https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1 [wng-chl]: https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf [wng-tst]: https://huggingface.co/datasets/ikawrakow/winogrande-eval-for-llama.cpp/tree/main [wng]: https://winogrande.allenai.org