Hello, I am fine-tuning an SLM for an AI festival. I am aiming to make the model stronger in a specific language. Unfortunately, there are not many fine-tuning-ready datasets for the language I’m aiming for, and because of my hardware limitations and internet restrictions, I cannot continue pretraining the model. I wanted to ask two things: Can I use LoRA to simulate continued pretraining? And how can I build a QA dataset from raw Wikipedia dumps? I should mention that the model I want to improve can already produce simple outputs in the target language, but it struggles with longer, more complex generations. I tried to use LoRA to continue pre training the Model, but it just made the model worst at the target language, i tried QA dataset in the target language too; and i got bad results again.
Hi @Bidram — nice project. I’ll take your two questions in order, plus the “longer, more complex outputs” goal underneath them, since they connect.
1. LoRA to simulate continued pretraining — yes, with honest caveats.
You can run continued pretraining (plain next-token prediction on raw text) through a LoRA adapter, not just instruction tuning. Be realistic about one thing: a low-rank adapter has limited room to add genuinely new knowledge or a new language — it’s much better at strengthening what the base already half-knows. You said the model already produces simple text in the language, so you’re in the good case: amplifying an existing ability, not teaching from zero.
To get the most from it:
- Use a higher rank than the SFT default (try 64–128) and apply LoRA to all linear layers (q, k, v, o and the MLP gate/up/down), not just q/v.
- Check tokens-per-word for your language. If the tokenizer shreds it into single bytes, that caps quality more than anything else, and the real fix is extending the tokenizer and training the new embeddings — heavier, but it’s often the actual bottleneck for low-resource languages.
- Given your hardware, do it as QLoRA (4-bit base + LoRA), short sequences packed together, small batch with gradient accumulation.
A sequence that works well: first, LoRA continued-pretraining on raw target-language text to build fluency and length; then a smaller QA/SFT pass for task format. Fluency comes from the first stage — the QA set mostly teaches answer shape.
2. A QA set from Wikipedia dumps — pick the route that fits your constraints.
- Source: skip the raw XML and pull the cleaned per-language parquet from the
wikimedia/wikipediadataset on the Hub. You download just your language once, which helps if bandwidth is restricted. - With no outside API and limited compute (sounds like your situation), a fully local, templated approach goes surprisingly far: split each article into passages and build extractive QA from structure — lead sentence (“X is a …”) → “What is X?”; section headings → a question about that section; dated or numbered facts → fill-in-the-blank. Cheap, no model needed, and every answer stays grounded in the text.
- If you can run even a small instruct model locally, have it draft question/answer pairs from each passage (give it the passage, ask for a few questions answerable only from that text), then drop any pair whose answer isn’t actually in the passage. Generate in the target language if the model can — English-then-translate loses quality.
- Either route: dedupe, discard unanswerable or garbled pairs, and store as {question, optional context, answer}.
On the “struggles with longer, complex outputs” symptom specifically: that’s usually a fluency and coverage gap, so I’d put more effort into the raw-text continued-pretraining stage than into the QA set — the QA mostly fixes format, not depth.
Good luck at the festival — happy to go deeper on any of these if you share the language and the base model you’re starting from.
Thank you very much for your help! My target language is Persian (Iranian), and for the base model I used Qwen 3 0.6B Instruct. I chose this model because I am targeting low-end devices, such as phones and school computers.
My main limitation is hardware. I have an RTX 3050 Mobile 4Gb, 32Gb RAM, and an Intel i5-11400H.
If you have time, I have some more questions:
I tried to continue pretraining through a LoRA adapter using the Unsloth library in Python with these LoRA adapter settings:
model = FastLanguageModel.get_peft_model(
model,
r=8,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)
I used a .jsonl file containing semi-cleaned raw Wikipedia text in the target language, but the results were not very good and were weaker than the original model. Unfortunately, I do not have the training logs, but as I remember, the model started with a loss of around 2.3, and by the end of training it reached about 1.5.
I used an 800Mb cleaned Persian Wikipedia .jsonl file to test whether the model would improve or not, but the results were negative, at least in my setup.
For fine-tuning, I used two datasets:
Persian-Wiki-QA (a Persian Wikipedia QA dataset)
Alpaca-Persian-Cleaned (a cleaned Persian translation of Alpaca, translated using Google Translate)
This is the training script I used to train the model:
import os
from datasets import load_dataset
from unsloth import FastLanguageModel
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
MODEL_PATH = r"F:\Deep Learning\Model\Qwen 3 0.6B SafeTensor"
DATA_PATH = r"E:\Ava1\Dataset-pars\Wikipedia-Persian\Wikipedia-Persian-textonly.jsonl"
OUTPUT_DIR = r"F:\Deep Learning\Outputs\qwen3-0.6b-persian-pretrain-lora"
MAX_SEQ_LENGTH = 256
LOAD_IN_4BIT = True
def get_latest_checkpoint(output_dir):
if not os.path.exists(output_dir):
return None
checkpoints = []
for name in os.listdir(output_dir):
path = os.path.join(output_dir, name)
if os.path.isdir(path) and name.startswith("checkpoint-"):
try:
step = int(name.split("-")[-1])
checkpoints.append((step, path))
except:
pass
if not checkpoints:
return None
checkpoints.sort(key=lambda x: x[0])
return checkpoints[-1][1]
def main():
os.makedirs(OUTPUT_DIR, exist_ok=True)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_PATH,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None,
load_in_4bit=LOAD_IN_4BIT,
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = FastLanguageModel.get_peft_model(
model,
r=8,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)
dataset = load_dataset("json", data_files=DATA_PATH, split="train")
def clean_example(example):
text = example.get("Text", "")
if text is None:
text = ""
return {"text": str(text).strip()}
dataset = dataset.map(
clean_example,
remove_columns=dataset.column_names,
load_from_cache_file=False,
)
dataset = dataset.filter(
lambda x: len(x["text"]) > 50,
load_from_cache_file=False,
)
eos_token = tokenizer.eos_token or ""
def add_eos(example):
return {"text": example["text"] + eos_token}
dataset = dataset.map(add_eos, load_from_cache_file=False)
dataset = dataset.select(range(min(2000, len(dataset))))
def tokenize_function(examples):
outputs = tokenizer(
examples["text"],
truncation=True,
max_length=MAX_SEQ_LENGTH,
padding=False,
)
outputs["labels"] = outputs["input_ids"].copy()
return outputs
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
batch_size=1000,
remove_columns=["text"],
load_from_cache_file=False,
)
latest_checkpoint = get_latest_checkpoint(OUTPUT_DIR)
if latest_checkpoint:
print(f"Resuming from checkpoint: {latest_checkpoint}")
else:
print("No checkpoint found. Starting fresh.")
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
)
trainer = Trainer(
model=model,
train_dataset=tokenized_dataset,
data_collator=data_collator,
args=TrainingArguments(
output_dir=OUTPUT_DIR,
per_device_train_batch_size=1,
gradient_accumulation_steps=32,
warmup_steps=20,
max_steps=100,
learning_rate=1e-4,
fp16=True,
bf16=False,
logging_steps=5,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="cosine",
save_steps=50,
save_total_limit=2,
report_to="none",
seed=3407,
remove_unused_columns=False,
dataloader_num_workers=0,
),
)
trainer.train(resume_from_checkpoint=latest_checkpoint)
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Done. Saved to: {OUTPUT_DIR}")
if __name__ == "__main__":
main()
Is my Persian Data small? I have another ~15GB Of Semi-Clean .txt Persian Wikipedia text that could help.
I thought about Translating Good English Datasets using a Local 1.8B Translator ( works well, but now Wow ) And then fine-tuning a Small SLM to make the dataset more natural, But you said “English-then-translate loses quality”
I Wrote a new Training script with LoRA adapter rank 64 and a alpha of 64, Its slower now, but i’m letting it train and see what will come off next. These are the Logs :
[transformers] ==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 1,109,504 | Num Epochs = 1 | Total steps = 34,672
O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 8
\ / Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
"-____-" Trainable parameters = 40,370,176 of 636,420,096 (6.34% trained)
0%| | 0/34672 [00:00<?, ?it/s][transformers] `use_return_dict` is deprecated! Use `return_dict` instead!
{'loss': '2.918', 'grad_norm': '1.558', 'learning_rate': '9e-06', 'epoch': '0.0002884'}
{'loss': '2.773', 'grad_norm': '1.016', 'learning_rate': '1.9e-05', 'epoch': '0.0005768'}
{'loss': '2.621', 'grad_norm': '0.7811', 'learning_rate': '2.9e-05', 'epoch': '0.0008653'}
{'loss': '2.546', 'grad_norm': '0.7259', 'learning_rate': '3.9e-05', 'epoch': '0.001154'}
{'loss': '2.322', 'grad_norm': '0.6498', 'learning_rate': '4.9e-05', 'epoch': '0.001442'}
{'loss': '2.378', 'grad_norm': '0.6687', 'learning_rate': '5.9e-05', 'epoch': '0.001731'}
{'loss': '2.293', 'grad_norm': '0.7748', 'learning_rate': '6.9e-05', 'epoch': '0.002019'}
{'loss': '2.309', 'grad_norm': '0.7728', 'learning_rate': '7.9e-05', 'epoch': '0.002307'}
{'loss': '2.214', 'grad_norm': '0.7984', 'learning_rate': '8.9e-05', 'epoch': '0.002596'}
{'loss': '2.203', 'grad_norm': '0.7599', 'learning_rate': '9.9e-05', 'epoch': '0.002884'}
{'loss': '2.227', 'grad_norm': '0.8078', 'learning_rate': '0.0001', 'epoch': '0.003173'}
{'loss': '2.179', 'grad_norm': '0.7982', 'learning_rate': '0.0001', 'epoch': '0.003461'}
{'loss': '2.155', 'grad_norm': '0.7713', 'learning_rate': '0.0001', 'epoch': '0.003749'}
{'loss': '2.102', 'grad_norm': '0.7508', 'learning_rate': '0.0001', 'epoch': '0.004038'}
{'loss': '2.094', 'grad_norm': '0.7127', 'learning_rate': '0.0001', 'epoch': '0.004326'}
{'loss': '2.146', 'grad_norm': '0.7476', 'learning_rate': '0.0001', 'epoch': '0.004615'}
{'loss': '2.163', 'grad_norm': '0.7814', 'learning_rate': '0.0001', 'epoch': '0.004903'}
{'loss': '2.079', 'grad_norm': '0.7486', 'learning_rate': '0.0001', 'epoch': '0.005192'}
{'loss': '2.142', 'grad_norm': '0.7015', 'learning_rate': '0.0001', 'epoch': '0.00548'}
{'loss': '2.083', 'grad_norm': '0.7789', 'learning_rate': '0.0001', 'epoch': '0.005768'}
{'loss': '2.085', 'grad_norm': '0.7249', 'learning_rate': '0.0001', 'epoch': '0.006057'}
{'loss': '2.03', 'grad_norm': '0.7872', 'learning_rate': '0.0001', 'epoch': '0.006345'}
{'loss': '2.085', 'grad_norm': '0.7494', 'learning_rate': '0.0001', 'epoch': '0.006634'}
{'loss': '2.087', 'grad_norm': '0.7177', 'learning_rate': '0.0001', 'epoch': '0.006922'}
{'loss': '2.073', 'grad_norm': '0.6864', 'learning_rate': '0.0001', 'epoch': '0.00721'}
{'loss': '2.007', 'grad_norm': '0.7131', 'learning_rate': '9.999e-05', 'epoch': '0.007499'}
{'loss': '2.074', 'grad_norm': '0.7246', 'learning_rate': '9.999e-05', 'epoch': '0.007787'}
{'loss': '2.011', 'grad_norm': '0.6955', 'learning_rate': '9.999e-05', 'epoch': '0.008076'}
{'loss': '1.99', 'grad_norm': '0.666', 'learning_rate': '9.999e-05', 'epoch': '0.008364'}
{'loss': '2.01', 'grad_norm': '0.655', 'learning_rate': '9.999e-05', 'epoch': '0.008653'}
{'loss': '2.032', 'grad_norm': '0.6227', 'learning_rate': '9.999e-05', 'epoch': '0.008941'}
{'loss': '2.003', 'grad_norm': '0.7089', 'learning_rate': '9.999e-05', 'epoch': '0.009229'}
{'loss': '2.074', 'grad_norm': '0.6487', 'learning_rate': '9.999e-05', 'epoch': '0.009518'}
{'loss': '2.073', 'grad_norm': '0.6404', 'learning_rate': '9.999e-05', 'epoch': '0.009806'}
{'loss': '2.052', 'grad_norm': '0.6166', 'learning_rate': '9.999e-05', 'epoch': '0.01009'}
{'loss': '2.011', 'grad_norm': '0.6447', 'learning_rate': '9.999e-05', 'epoch': '0.01038'}
{'loss': '1.962', 'grad_norm': '0.6773', 'learning_rate': '9.999e-05', 'epoch': '0.01067'}
{'loss': '1.97', 'grad_norm': '0.6069', 'learning_rate': '9.998e-05', 'epoch': '0.01096'}
{'loss': '1.982', 'grad_norm': '0.6222', 'learning_rate': '9.998e-05', 'epoch': '0.01125'}
{'loss': '2.021', 'grad_norm': '0.6198', 'learning_rate': '9.998e-05', 'epoch': '0.01154'}
{'loss': '1.985', 'grad_norm': '0.629', 'learning_rate': '9.998e-05', 'epoch': '0.01183'}
{'loss': '1.959', 'grad_norm': '0.6498', 'learning_rate': '9.998e-05', 'epoch': '0.01211'}
{'loss': '2.03', 'grad_norm': '0.6532', 'learning_rate': '9.998e-05', 'epoch': '0.0124'}
{'loss': '1.933', 'grad_norm': '0.6459', 'learning_rate': '9.998e-05', 'epoch': '0.01269'}
{'loss': '1.985', 'grad_norm': '0.6385', 'learning_rate': '9.997e-05', 'epoch': '0.01298'}
{'loss': '1.995', 'grad_norm': '0.6946', 'learning_rate': '9.997e-05', 'epoch': '0.01327'}
{'loss': '1.957', 'grad_norm': '0.6225', 'learning_rate': '9.997e-05', 'epoch': '0.01356'}
{'loss': '1.969', 'grad_norm': '0.5927', 'learning_rate': '9.997e-05', 'epoch': '0.01384'}
{'loss': '1.993', 'grad_norm': '0.6287', 'learning_rate': '9.997e-05', 'epoch': '0.01413'}
{'loss': '1.964', 'grad_norm': '0.6393', 'learning_rate': '9.997e-05', 'epoch': '0.01442'}
{'loss': '1.954', 'grad_norm': '0.6973', 'learning_rate': '9.997e-05', 'epoch': '0.01471'}
{'loss': '1.981', 'grad_norm': '0.5865', 'learning_rate': '9.996e-05', 'epoch': '0.015'}
{'loss': '1.903', 'grad_norm': '0.6059', 'learning_rate': '9.996e-05', 'epoch': '0.01529'}
{'loss': '1.972', 'grad_norm': '0.6095', 'learning_rate': '9.996e-05', 'epoch': '0.01557'}
{'loss': '1.939', 'grad_norm': '0.6175', 'learning_rate': '9.996e-05', 'epoch': '0.01586'}
{'loss': '1.919', 'grad_norm': '0.5895', 'learning_rate': '9.996e-05', 'epoch': '0.01615'}
{'loss': '1.915', 'grad_norm': '0.5853', 'learning_rate': '9.995e-05', 'epoch': '0.01644'}
{'loss': '1.883', 'grad_norm': '0.6089', 'learning_rate': '9.995e-05', 'epoch': '0.01673'}
{'loss': '1.935', 'grad_norm': '0.6074', 'learning_rate': '9.995e-05', 'epoch': '0.01702'}
{'loss': '1.931', 'grad_norm': '0.6369', 'learning_rate': '9.995e-05', 'epoch': '0.01731'}
{'loss': '1.896', 'grad_norm': '0.6057', 'learning_rate': '9.995e-05', 'epoch': '0.01759'}
{'loss': '1.944', 'grad_norm': '0.5966', 'learning_rate': '9.994e-05', 'epoch': '0.01788'}
{'loss': '1.888', 'grad_norm': '0.6135', 'learning_rate': '9.994e-05', 'epoch': '0.01817'}
{'loss': '2.014', 'grad_norm': '0.6026', 'learning_rate': '9.994e-05', 'epoch': '0.01846'}
{'loss': '1.884', 'grad_norm': '0.6325', 'learning_rate': '9.994e-05', 'epoch': '0.01875'}
{'loss': '1.881', 'grad_norm': '0.6144', 'learning_rate': '9.994e-05', 'epoch': '0.01904'}
{'loss': '1.927', 'grad_norm': '0.616', 'learning_rate': '9.993e-05', 'epoch': '0.01932'}
{'loss': '1.924', 'grad_norm': '0.6267', 'learning_rate': '9.993e-05', 'epoch': '0.01961'}
2%|█▍ | 683/34672 [53:56<42:15:31, 4.48s/it]
And the script :
import os
from datasets import load_dataset
from unsloth import FastLanguageModel
from transformers import TrainingArguments, Trainer
MODEL_PATH = r"F:\Deep Learning\Model\Qwen 3 0.6B SafeTensor"
DATA_PATH = r"E:\Ava1\Dataset-pars\Wikipedia-Persian\Wikipedia-Persian-textonly.jsonl"
OUTPUT_DIR = r"F:\Deep Learning\Outputs\qwen3-0.6b-persian-pretrain-lora3"
MAX_SEQ_LENGTH = 256
LOAD_IN_4BIT = True
NUM_PROC = 1
def prepare_dataset(tokenizer):
print("📚 Loading dataset...")
dataset = load_dataset(
"json",
data_files=DATA_PATH,
split="train",
)
text_column = "Text" if "Text" in dataset.column_names else dataset.column_names[0]
print("Fast tokenization (no truncation)...")
def tokenize_function(examples):
return tokenizer(
examples[text_column],
add_special_tokens=True,
truncation=False,
)
tokenized = dataset.map(
tokenize_function,
batched=True,
remove_columns=dataset.column_names,
desc="Tokenizing",
)
print("Packing into continuous GPT blocks...")
def group_texts(examples):
concatenated = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated["input_ids"])
total_length = (total_length // MAX_SEQ_LENGTH) * MAX_SEQ_LENGTH
result = {
k: [t[i:i + MAX_SEQ_LENGTH] for i in range(0, total_length, MAX_SEQ_LENGTH)]
for k, t in concatenated.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_dataset = tokenized.map(
group_texts,
batched=True,
batch_size=1000,
desc="Packing blocks",
)
lm_dataset = lm_dataset.with_format("torch")
return lm_dataset
def main():
os.makedirs(OUTPUT_DIR, exist_ok=True)
print("Loading model with Unsloth...")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_PATH,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None,
load_in_4bit=LOAD_IN_4BIT,
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print("Applying LoRA adapters...")
model = FastLanguageModel.get_peft_model(
model,
r=64,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=64,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
)
train_dataset = prepare_dataset(tokenizer)
print(f"Packed samples: {len(train_dataset):,}")
print("Starting continued pretraining...")
trainer = Trainer(
model=model,
train_dataset=train_dataset,
args=TrainingArguments(
output_dir=OUTPUT_DIR,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
warmup_steps=100,
num_train_epochs=1,
learning_rate=1e-4,
bf16=True,
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="cosine",
save_steps=500,
save_total_limit=2,
report_to="none",
seed=3407,
remove_unused_columns=False,
),
)
trainer.train()
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Done! Model saved to: {OUTPUT_DIR}")
if __name__ == "__main__":
main()
For now, based on my observations while observing Reuben’s Konkani LLM project from the sidelines, I’ve put together a guide outlining key considerations for creating datasets and fine-tuning models for low-resource languages:
- low_resource_lang_ft/low_resource_lang_ft_01_start_here_20260603.md · John6666/knowledge_base_md_for_rag_1 at main
- low_resource_lang_ft/low_resource_lang_ft_02_language_digital_diagnosis_20260603.md · John6666/knowledge_base_md_for_rag_1 at main
- low_resource_lang_ft/low_resource_lang_ft_03_data_acquisition_routes_20260603.md · John6666/knowledge_base_md_for_rag_1 at main
- low_resource_lang_ft/low_resource_lang_ft_04_validation_evaluation_governance_20260603.md · John6666/knowledge_base_md_for_rag_1 at main
- low_resource_lang_ft/low_resource_lang_ft_05_finetuning_lifecycle_20260603.md · John6666/knowledge_base_md_for_rag_1 at main
- low_resource_lang_ft/low_resource_lang_ft_06_case_studies_and_resource_shelves_20260603.md · John6666/knowledge_base_md_for_rag_1 at main
- low_resource_lang_ft/low_resource_lang_ft_07_templates_and_checklists_20260603.md · John6666/knowledge_base_md_for_rag_1 at main
Thank you so much for the detailed help. I’ll spend some time reading through all of those guides carefully. I really appreciate it.
Hello, I read your guide and learned several new things from it. One part that I found especially helpful was the section about checking whether a tokenizer supports my target language well.
After reading that, I tested the Qwen 3 0.6B tokenizer on Persian, and its performance was quite poor. I also tested Qwen 3.5 0.8B, which was better, but still not good enough for strong Persian support.
So I wanted to ask where can I find a base model that is truly strong for Persian?
Or can i Somehow fine-tune a Tokenizer for Persian?
But i found a way myself though, if i cant improve the Tokenizer, i can help it. maybe i try to make a normalizer first and then extend it more and more and measure changes in the Tokenized output.