When working on generating datasets and "add a link" models for the 5th round of wikis, the training pipeline worked well for all models besides the Tibetan Wikipedia - bowiki. As seen in the screenshot below, the pipeline kept failing at the point where it runs the backtesting evaluation:
Checked files to confirm and indeed the bowiki.backtest.eval.csv file is missing as shown in the screenshot below:
Talked to @MGerlach and he said:
This means that tot_ret=0 which leads to a division by zero when calculating the precision. the precision is defined as the TruePositives/(TruePositives+FalsePositives), i.e. from all the links suggested by the algorithm (TP+FP) how many were indeed correct links (TP). In this case, this means that the algorithm did not suggest any links (TP+FP=0). In this case, the precision is not defined. We should make add a check that the denominators in the precision and recall or larger than 0 and otherwise return NaN.
I believe this happens because the train and test data in bowiki is very small (27 sentences each only). while wikistats says there are ~12k articles in bowiki, most of them seem to be without any links which is actually quite interesting (here you can check random articles in bowiki). we discard articles without links for training or testing because we need actual links to train and test.
So, overall, we just have to create a patch where we add something like:
from math import nan if tot_ret>0: micro_precision = tot_TP / tot_ret else: micro_precision = nan if tot_rel>0: micro_recall = tot_TP / tot_rel else: micro_recall = nan