Pre-train ALBERT from scratch for Persian/Farsi language

1. Pre-train Albert from Scratch for the Persian Language

Currently, there are some open-source language models for the Farsi language. We Crawl some new samples as a new dataset which I want to use for pre-training Albert for the farsi language from scratch.

2. Language

Persian/Farsi

3. Model

ALBERT xLarge

4. Datasets

Wikipedia dump
Common Crawl dump
random web scraps

5. Training scripts

There are already Flax scripts to pre-train Albert that we can easily use:

transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub

6. Challenges

Using Dataset pipeline (ETL approach) for new txt crawled file

Sounds amazing, I’m also interested in participating.

thanks sure, We are waiting for accept

Alright finalizing it!

I’m interested to join

I’d be happy to help you with this project!

@patrickvonplaten, do I need to request TPU access, or can I collaborate using the credential by one of the team members since I have access to the hf-flax group?