This repository contains:
- Scripts for training and evaluating Sinhala fine-tuned models on QA tasks.
- A
data/
folder with the Sinhala QA datasets, including training, validation, and test sets used for model development and benchmarking. - 📦 The translated dataset used for training is publicly available on Hugging Face: SiQuAD
Below are key companion repositories used in this workflow:
- Tool developed for manually annotating QA pairs in Sinhala.
- Used to create the Sinhala QA test set for evaluation.
- Supports context selection, question writing, and answer span marking.
- Scripts and pipeline used for translating the SQuAD dataset into Sinhala.
- Includes preprocessing, automatic translation, post-editing, and alignment verification steps.
- Contains scripts for scraping Sinhala news articles.
- Data gathered was used to build a seed context dataset to support Sinhala QA development and fine-tuning of models
- 📦 The full dataset used for extracting passages for QA is publicly available on Hugging Face: Sinhala-News-Wiki-text-corpus