Authors:
Petr Hyner
1
;
2
;
Petr Marek
3
;
David Adamczyk
2
;
1
;
Jan Hůla
3
;
2
and
Jan Šedivý
3
Affiliations:
1
Department of Informatics and Computers, Faculty of Science, University of Ostrava, Ostrava, Czech Republic
;
2
Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, Ostrava, Czech Republic
;
3
Czech Technical University in Prague, Prague, Czech Republic
Keyword(s):
Language Models, Neural Networks, Transfer Learning, Vocabulary Swap.
Abstract:
We present a simple approach for efficiently adapting pre-trained English language models to generate text in lower-resource language, specifically Czech. We propose a vocabulary swap method that leverages parallel corpora to map tokens between languages, allowing the model to retain much of its learned capabilities. Experiments conducted on a Czech translation of the TinyStories dataset demonstrate that our approach significantly outperforms baseline methods, especially when using small amounts of training data. With only 10% of the data, our method achieves a perplexity of 17.89, compared to 34.19 for the next best baseline. We aim to contribute to work in the field of cross-lingual transfer in natural language processing and we propose a simple to implement, computationally efficient method tested in a controlled environment.