Papers
arxiv:2412.17562

ERUPD -- English to Roman Urdu Parallel Dataset

Published on Dec 23, 2024
Authors:
,

Abstract

A novel parallel dataset of Roman Urdu is created using a hybrid approach, combining synthetic and real-world data, to address language processing challenges and serve machine translation, sentiment analysis, and multilingual education.

Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine translation, sentiment analysis, and multilingual education.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2412.17562
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.17562 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.17562 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.17562 in a Space README.md to link it from this page.

Collections including this paper 1