This repository provides scripts to translate the SQuAD v1.0 dataset into Sinhala using the Google Translate API, and to clean and validate the resulting QA data by correcting answer spans and removing invalid entries.
- Translates context, question, and answer fields to Sinhala.
- Recalculates answer start indices post-translation.
- Filters out QA pairs where the translated answer is not found in the translated context.
- Preserves the SQuAD format for compatibility with existing QA models and tools.
git clone https://github.com/j-ranasinghe/SQuAD-Translation.git
cd SQuAD-Translation
pip install -r requirements.txt
- Go to Google Cloud Console.
- Enable the Cloud Translation API.
- Create a service account.
- Download the key and save it as credentials.json in the project root.
🚀 Running the Translation
python translation_pipeline.py
🧹 Cleaning the Translations - To clean the translated dataset (e.g., fix answer spans, remove invalid answers):
python clean_translations.py