Pre-train ALBERT from scratch for Persian/Farsi language

ghofrani · July 7, 2021, 1:42pm

1. Pre-train Albert from Scratch for the Persian Language

Currently, there are some open-source language models for the Farsi language. We Crawl some new samples as a new dataset which I want to use for pre-training Albert for the farsi language from scratch.

2. Language

Persian/Farsi

3. Model

ALBERT xLarge

4. Datasets

Wikipedia dump
Common Crawl dump
random web scraps

5. Training scripts

There are already Flax scripts to pre-train Albert that we can easily use:

transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub

6. Challenges

Using Dataset pipeline (ETL approach) for new txt crawled file

Mojtaba · July 7, 2021, 2:43pm

Sounds amazing, I’m also interested in participating.

ghofrani · July 7, 2021, 2:44pm

thanks sure, We are waiting for accept

patrickvonplaten · July 7, 2021, 6:06pm

Alright finalizing it!

alphareality · July 8, 2021, 3:00pm

I’m interested to join

m3hrdadfi · July 10, 2021, 4:21pm

I’d be happy to help you with this project!

@patrickvonplaten, do I need to request TPU access, or can I collaborate using the credential by one of the team members since I have access to the hf-flax group?

Topic		Replies	Views
ALBERT Pretraining example (Tensorflow) Beginners	2	366	November 23, 2020
PreTrain GPT2 from scratch in Persian Flax/JAX Projects	15	2197	July 7, 2021
Collaborative Training Experiment of an Albert Model for Bengali Research	1	1336	May 6, 2021
Pretrain T5 for Arabic Flax/JAX Projects	17	2837	June 11, 2023
PreTrain GPT2 from scratch in Punjabi Flax/JAX Projects	2	470	June 29, 2021