The Armenian Wikipedia (hywiki) training pipeline got stuck at the step where it generates backtesting data as shown in the screenshot below:
Decided to let it continue running for over 10 hrs as I wasn't sure whether it was stuck because of a bug or the amount of data it was processing.
Consulted @MGerlach whether he had ever faced this issue and he said:
In another project, I recently came across that for armenian wikipedia, the standard sentence-processing pipeline didnt work. Looking at some articles in hywiki, I quickly realized that Armenian uses the "։" as a sentence-marker (which is not the same as colon). It thus makes sense that it gets stuck when generating the backtesting data where we extract individual sentences with existing links. In this case, the sentences will be way too long so then probably gets hung up somewhere.
The goal is to replace the Armenian sentence-splitting symbol so that the link recommendation algorithm can run sentence tokenization successfully.