PULI LlumiX 32K base (6.74B billion parameter)

For further details or testing our instruct model, see our demo site.

  • Trained with OpenChatKit github
  • The LLaMA-2-7B-32K model were continuously pretrained on Hungarian dataset
  • The model has been extended to a context length of 32K with position interpolation
  • Checkpoint: 100 000 steps

Dataset for continued pretraining

  • Hungarian: 7.9 billion words, documents (763K) that exceed 5000 words in length
  • English: Long Context QA (1 billion words), BookSum (42 million words)

Limitations

  • max_seq_length = 32 768
  • float16
  • vocab size: 32 000

Usage with pipeline

from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("NYTK/PULI-LlumiX-32K")
tokenizer = LlamaTokenizer.from_pretrained("NYTK/PULI-LlumiX-32K")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

print(generator(prompt, max_new_tokens=30)[0]["generated_text"])

Citation

If you use this model, please cite the following paper:

@inproceedings {yang-llumix,
    title = {The First Instruct-Following Large Language Models for Hungarian},
    booktitle = {2024 IEEE 3rd Conference on Information Technology and Data Science (CITDS) Proceedings},
    year = {2024},
    publisher = {University of Debrecen},
    address = {Debrecen, Hungary},
    author = {Zijian {\relax Gy}őző Yang and Réka Dodé and Gerg\H{o} Ferenczi and  Péter Hatvani and Enik\H{o} Héja and Gábor Madarász and Noémi Ligeti-Nagy and Bence Sárossy and {\relax Zs}ófia Szaniszló and Tamás Váradi and Tamás Verebélyi and Gábor Prószéky},
    pages = {247--252},
    isbn = {9798350387889}
}
Downloads last month
265
Safetensors
Model size
7B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NYTK/PULI-LlumiX-32K

Adapters
1 model
Finetunes
5 models
Quantizations
3 models