4.1 Datasets
The first of our two datasets is a version (created by Esuli et al. [
20]) of RCV1/RCV2, a corpus of news stories published by Reuters. This version of RCV1/RCV2 contains documents each written in one of nine languages (English, Italian, Spanish, French, German, Swedish, Danish, Portuguese, and Dutch) and classified according to a set of 73 classes. The dataset consists of 10 random samples, obtained from the original RCV1/RCV2 corpus, each consisting of 1,000 training documents and 1,000 test documents for each of the nine languages (Dutch being an exception, since only 1,794 Dutch documents are available; in this case, each sample consists of 1,000 training documents and 794 test documents). Note, though, that, while each random sample is balanced at the language level (same number of training documents per language and same number of test documents per language), it is not balanced at the class level: At this level the dataset RCV1/RCV2 is highly imbalanced (the number of documents per class ranges from 1 to 3,913—see Table
1), and each of the 10 random samples is, too. The fact that each language is equally represented in terms of both training and test data allows the many-shot experiments to be carried out in controlled experimental conditions, i.e., minimizes the possibility that the effects observed for the different languages are the result of different amounts of training data. (Of course, zero-shot experiments will instead be run by excluding the relevant training set(s).) Both the original RCV1/RCV2 corpus and the version we use here are comparable at topic level, as news stories are not direct translations of each other but simply discuss the same or related events in different languages.
The second of our two datasets is a version (created by Esuli et al. [
20]) of JRC-Acquis, a corpus of legislative texts published by the European Union. This version of JRC-Acquis contains documents each written in 1 of 11 languages (the same 9 languages of RCV1/RCV2 plus Finnish and Hungarian) and classified according to a set of 300 classes. The dataset is parallel, i.e., each document is included in 11 translation-equivalent versions, one per language. Similarly to the case of RCV1/RCV2 above, the dataset consists of 10 random samples, obtained from the original JRC-Acquis corpus, each consisting of at least 1,000 training documents for each of the 11 languages (summing up to a total of 12,687 training documents in each sample) and 4,242 test documents for each of the 11 languages. As in the case of RCV1/RCV2, this version of JRC-Acquis is not balanced at the class level (the number of positive examples per class ranges from 55 to 1,155), and the samples obtained from it are not balanced either. Note that, in this case, Esuli et al. [
20] included at most one of the 11 language-specific versions in a training set to avoid the presence of translation-equivalent content in the training set; this enables one to measure the contribution of training information coming from different languages in a more realistic setting. When a document is included in a test set, instead, all its 11 language-specific versions are also included to allow a perfectly fair evaluation across languages, since each of the 11 languages is thus evaluated on exactly the same content.
For both datasets, the results reported in this article (similarly to those of Reference [
20]) are averages across the 10 random selections. Summary characteristics of our two datasets are reported in Table
1; excerpts from sample documents from the two datasets are displayed in Table
2.
4.3 Learners
Wherever possible, we use the same learner as used in Reference [
20], i.e.,
support vector machines (SVMs) as implemented in the
scikit-learn package.
9 For the 2nd-tier classifier of
gFun, and for all the baseline methods, we optimize the
\(C\) parameter, that trades off between training error and margin by testing all values
\(C=10^{i}\) for
\(i\in \lbrace -1,\ldots , 4\rbrace\) by means of 5-fold cross-validation. We use Platt calibration to calibrate the 1st-tier classifiers used in the Posteriors VGF and (when using averaging as the aggregation policy) the classifiers that map document views into vectors of posterior probabilities. We employ the linear kernel for the 1st-tier classifiers used in the Posteriors VGF, and the RBF kernel (i) for the classifiers used for implementing the averaging aggregation policy and (ii) for the 2nd-tier classifier.
To generate the BERT VGF (see Section
3.4), we rely on the pre-trained model released by
Huggingface10 [
66]. For each run, we train the model following the settings suggested by Devlin et al. [
17], i.e., we add one classification layer on top of the output of mBERT (the special token
[CLS]) and fine-tune the entire model end-to-end by minimizing the binary cross-entropy loss function. We use the AdamW optimizer [
36] with the learning rate set to 1e-5 and the weight decay set to 0.01. We also set the learning rate to decrease by means of a scheduler (StepLR) with step size equal to 25 and gamma equal to 0.1. We set the training batch size to 4 and the maximum length of the input (in terms of tokens) to 512 (which is the maximum input length of the model). Given that the number of training examples in our datasets is comparatively smaller than that used in Devlin et al. [
17], we reduce the maximum number of epochs to 50 and apply an early-stopping criterion that terminates the training after five epochs showing no improvement (in terms of
\(F_{1}^{M}\)) in the validation set (a held-out split containing 20% of the training documents) to avoid overfitting. After convergence, we perform one last training epoch on the validation set.
Each of the experiments we describe is performed 10 times, on 10 different samples extracted from the dataset, to assess its statistical significance by means of the paired t-test mentioned in Section
3.6. All the results displayed in the tables included in this article are averages across these 10 samples and across the
\(|\mathcal {L}|\) languages in the datasets.
We run all the experiments on a machine equipped with a 12-core processor Intel Core i7-4930K at 3.40 GHz with 32 GB of RAM under Ubuntu 18.04 (LTS) and Nvidia GeForce GTX 1080 equipped with 8 GB of RAM.
4.5 Results of Many-shot CLTC Experiments
In this section, we report the results that we have obtained in our many-shot CLTC experiments on the RCV1/RCV2 and JRC-Acquis datasets.
11 These experiments are run in “everybody-helps-everybody” mode, i.e., all training data, from all languages, contribute to the classification of all unlabelled data, from all languages.
We will use the notation -X to denote a
gFun instantiation that uses only one VGF, namely, the Posteriors VGF;
gFun-X is thus equivalent to the original
Fun architecture, but with the addition of the normalization steps discussed in Section
3.6. Analogously, -M will denote the use of the MUSEs VGF (Section
3.2), -W the use of the WCEs VGF (Section
3.3), and -B the use of the BERT VGF (Section
3.4).
Tables
3 and
4 report the results obtained on RCV1/RCV2 and JRC-Acquis, respectively. We denote different setups of
gFun by indicating after the hyphen the VGFs that the variant uses. For each dataset, we report the results for seven different baselines and nine different configurations of
gFun, as well as for two distinct evaluation metrics (
\(F_{1}\) and
\(K\)) aggregated across the
\(|\mathcal {Y}|\) different classes by both micro- and macro-averaging.
The results are grouped in four batches of methods. The first one contains all baseline methods. The remaining batches present results obtained using a selection of meaningful combinations of VGFs: The 2nd batch reports the results obtained by gFun when equipped with one single VGF, the 3rd batch reports ablation results, i.e., results obtained by removing one VGF from a setting containing all VGFs, while in the last batch, we report the results obtained by jointly using all the VGFs discussed.
The results clearly indicate that the fine-tuned version of multilingual BERT consistently outperforms all the other baselines, on both datasets. Concerning gFun’s results, among the different settings of the second batch (testing different VGFs in isolation), the only configuration that consistently outperforms mBERT in RCV1/RCV2 is gFun-B. Conversely, on JRC-Acquis, all four VGFs in isolation manage to beat mBERT for at least two evaluation measures. Most other configurations of gFun we have tested (i.e., configurations involving more than one VGF) consistently beat mBERT, with the sole exception of gFun-XMW on RCV1/RCV2.
Something that jumps to the eye is that
gFun-X yields better results than
Fun, which is different from it only for the the normalization steps of Section
3.6. This is a clear indication that these normalization steps are indeed beneficial.
Combinations relying on WCEs seem to perform comparably better in the JRC-Acquis dataset and worse in RCV1/RCV2. This can be ascribed to the fact that the amount of information brought about by word-class correlations is higher in the case of JRC-Acquis (since this dataset contains no fewer than 300 classes) than in RCV1/RCV2 (which only contains 73 classes). Notwithstanding this, the WCEs VGF seems to be the weakest among the VGFs that we have tested. Conversely, the strongest VGF seems to be the one based on mBERT, though it is also clear from the results that other VGFs contribute to further improve the performance of gFun; in particular, the combination gFun-XMB stands as the top performer overall, since it is always either the best-performing model or a model no different from the best performer in a statistically significant sense.
Upon closer examination of Tables
3 and
4, the 2nd, 3rd, and 4th batches help us in highlighting the contribution of each signal (i.e., information brought about by the VGFs).
Let us start from the 4th batch, where we report the results obtained by the configuration of gFun that exploits all of the available signals (gFun-XWMB). In RCV1/RCV2 such a configuration yields superior results to the single-VGF settings (note that even though results for gFun-B (.608) are higher than those for gFun-XWMB (.596), this difference is not statistically significant, with a \(p\)-value of .680, according to the two-tailed t-test that we have run). Such a result indicates that there is indeed a synergy among the heterogeneous representations.
In the 3rd batch, we investigate whether all of the signals are mutually beneficial or if there is some redundancy among them. We remove from the “full stack” (
gFun-XWMB) one VGF at a time. The removal of the BERT VGF has the worst impact on
\(F_{1}^{M}\). This was expected, since, in the single-VGF experiments,
gFun-B was the top-performing setup. Analogously, by removing representations generated by the Posteriors VGF or those generated by the MUSEs VGF, we have a smaller decrease in
\(F_{1}^{M}\) results. On the contrary, ditching WCEs results in a higher
\(F_{1}^{M}\) score (our top-scoring configuration); the difference between
gFun-XWMB and
gFun-XMB is not statically significant in RCV1/RCV2 (with a
\(p\)-value between 0.001 and 0.05), but it is significant in JRC-Acquis. This is an interesting fact: Despite the fact that in the single-VGF setting the WCEs VGF is the worst-performing, we were not expecting its removal to be beneficial. Such a behavior suggests that the WCEs are not well-aligned with the other representations, resulting in worse performance across all the four metrics. This is also evident if we look at results reported in Reference [
47]. If we remove from
gFun-XMW (.558) the Posteriors VGF, thus obtaining
gFun-MW, then we obtain an
\(F_{1}^{M}\) score of .536; by removing the MUSEs VGF, thus obtaining
gFun-XW, we lower the
\(F_{1}^{M}\) to .523; instead, by discarding the WCEs VGF, thus obtaining
gFun-XM, we increase
\(F_{1}^{M}\) to .575. This behavior tells us that the information encoded in the Posteriors and WCEs representations is diverging: In other words, it does not help in building more easily separable document embeddings. Results on JRC-Acquis are along the same line.
In Figure
4, we show a more in-depth analysis of the results, in which we compare, for each language, the relative improvements obtained in terms of
\(F_{1}^{M}\) (the other evaluation measures show similar patterns) by mBERT (the top-performing baseline) and a selection of
gFun configurations, with respect to the
Naïve solution.
These results confirm that the improvements brought about by
gFun-X with respect to
Fun are consistent across all languages, and not only as an average across them, for both datasets. The only configurations that underperform some monolingual naïve solutions (i.e., that have a
negative relative improvement) are
gFun-M (for Dutch) and
gFun-W (for Dutch and Portuguese) on RCV1/RCV2. These are also the only configurations that sometimes fare worse than the original
Fun. The configurations
gFun-B,
gFun-XMB, and
gFun-XWMB, all perform better than the baseline mBERT on almost all languages and on both datasets (the only exception for this happens for Portuguese when using
gFun-XWMB in RCV1/RCV2), with the improvements with respect to mBERT being markedly higher on JRC-Acquis. Again, we note that, despite the clear evidence that the VGF based on mBERT brings to bear the highest improvements overall, all other VGFs do contribute to improving the classification performance; the histograms of Figure
4 now reveal that the contributions are consistent across all languages. For example,
gFun-XMB outperforms
gFun-B for 6 out of 9 languages in RCV1/RCV2, and for all 11 languages in JRC-Acquis.
As a final remark, we should note that the document representations generated by the different VGFs are certainly not entirely independent (although their degree of mutual dependence would be hard to measure precisely), since they are all based on the distributional hypothesis, i.e., on the notion that systematic co-occurrence (of words and other words, of words and classes, of classes and other classes, etc.) is an evidence of correlation. However, in data science, mutual independence is not a necessary condition for usefulness; we all know this, e.g., from the fact that the “bag of words” model of representing text works well despite the fact that it makes use of thousands of features that are not independent of each other. Our results show that, in the best-performing setups of gFun, several such VGFs coexist despite the fact that they are probably not mutually independent, which seems to indicate that the lack of independence of these VGFs is not an obstacle.
4.6 Results of Zero-shot CLTC Experiments
Fun was not originally designed for dealing with zero-shot scenarios, since, in the absence of training documents for a given language, the corresponding first-tier language-dependent classifier cannot be trained. Nevertheless, Esuli et al. [
20] managed to perform zero-shot cross-lingual experiments by plugging in an auxiliary classifier trained on MUSEs representations that is invoked for any target language for which training data are not available, provided that this language is among the 30 languages covered by MUSEs.
Instead,
gFun caters for zero-shot cross-lingual classification
natively, provided that at least one among the VGFs it uses is able to generate representations for the target language with no training data (for the VGFs described in this article, this is the case of the MUSEs VGF and mBERT VGF for all the languages they cover). To see why, assume the
gFun-XWMB instance of
gFun using the averaging procedure for aggregation (Section
3.5). Assume that there are training documents for English and that there are no training data for Danish. We train the system in the usual way (Section
2). For a Danish test document, the MUSEs VGF
12 and the mBERT VGF contribute to its representation, since Danish is one of the languages covered by MUSEs and mBERT. The aggregation function averages across all four VGFs (-XWMB) for English test documents, while it only averages across two VGFs (-MB) for Danish test documents. Note that the meta-classifier does not perceive differences between English test documents and Danish test documents, since, in both cases, the representations it receives from the first tier come down to averages of calibrated (and normalized) posterior probabilities. Therefore, any language for which there are no training examples can be dealt with by our instantiation of
gFun provided that this language is catered for by MUSEs and/or mBERT.
To obtain results directly comparable with the zero-shot setup employed by Esuli et al. [
20], we reproduce their experimental setup. Thus, we run experiments in which we start with one single source language (i.e., a language endowed with its own training data), and we add new source languages iteratively, one at a time (in alphabetical order), until all languages for the given dataset are covered. At each iteration, we train
gFun on the available source languages and test on
all the target languages. At the
\(i\)th iteration, we thus have
\(i\) source languages and
\(|\mathcal {L}|\) target (test) languages, among which
\(i\) languages have their own training examples and the other
\((|\mathcal {L}|-i)\) languages do not. For this experiment, we choose the configuration involving all the VGFs (
gFun-XWMB).
The results are reported in Figures
5 and
6, where we compare the results obtained by
Fun and
gFun-XWMB on both datasets, for all our evaluation measures. Results are presented in a grid of three columns, in which the first one corresponds to the results of
Fun as reported in Reference [
20], the second one corresponds to the results obtained by
gFun-XWMB, and the third one corresponds to the difference between the two, in terms of absolute improvement of
gFun-XWMB w.r.t.
Fun. The results are arranged in four rows, one for each evaluation measure. Performance scores are displayed through heat-maps, in which columns represent target languages, and rows represent training iterations (with incrementally added source languages). Color coding helps interpret and compare the results: We use red for indicating low values of accuracy and green for indicating high values of accuracy (according to the evaluation measure used) for the first and second columns; the third column (absolute improvement) uses a different color map, ranging from dark blue (low improvement) to light green (high improvement). The tone intensities of the
Fun and
gFun color maps for the different evaluation measures are independent of each other, so the darkest red (respectively, the lightest green) always indicates the worst (respectively, the best) result obtained by any of the two systems
for the specific evaluation measure.
Note that the lower triangular matrix within each heat map reports results for standard (many-shot) cross-lingual experiments, while all entries above the main diagonal report results for zero-shot cross-lingual experiments. As was to be expected, results for many-shots experiments tend to display higher figures (i.e., greener cells), while results for zero-shot experiments generally display lower figures (i.e., redder cells). These figures clearly show the superiority of gFun over Fun, and especially so for the zero-shot setting, for which the magnitude of improvement is decidedly higher. The absolute improvement ranges from 18% of \(K^{M}\) to 28% of \(K^{\mu }\) on RCV1/RCV2 and from 35% of \(F_{1}^{M}\) to 44% of \(K^{\mu }\) in the case of JRC-Acquis.
In both datasets, the addition of new languages to the training set tends to help
gFun improve the classification of test documents also for other languages for which a training set was already available anyway. This is witnessed by the fact that the green tonality of the columns in the lower triangular matrix becomes gradually darker; for example, in JRC-Acquis, the classification of test documents in Danish evolves stepwise from
\(K=0.52\) (when the training set consists only of Danish documents) to
\(K=0.62\) (when all languages are present in the training set).
13A direct comparison between the old and new variants of funnelling is conveniently summarized in Figure
7, where we display average values of accuracy (in terms of our four evaluation measures) obtained by each method across all experiments of the same type, i.e., standard cross-lingual (CLTC – values from the lower diagonal matrices of Figures
5 and
6) or zero-shot cross-lingual (ZSCLC – values from the upper diagonal matrices), as a function of the number of training languages, for both datasets. These histograms reveal that
gFun improves over
Fun in the zero-shot experiments. Interestingly enough, the addition of languages to the training set seems to have a positive impact in
gFun, both for zero-shot and cross-lingual experiments.
4.8 Learning-curve Experiments
In this section, we report the results obtained in additional experiments aiming to quantify the impact on accuracy of variable amounts of target-language training documents. Given the supplementary nature of these experiments, we limit them to the RCV1/RCV2 dataset. Furthermore, for computational reasons, we carry out these experiments only on a subset of the original languages (namely, English, German, French, and Italian). In Figure
8, we report the results, in terms of
\(F^1_M\), obtained on RCV1/RCV2. For each of the four languages we work on, we assess the performance of
gFun-XMB by varying the amount of target-language training documents; we carry out experiments with 0%, 10%, 20%, 30%, 50%, and 100% of the training documents. For example, the experiments on French (Figure
8, bottom left) are run by testing on 100% of the French test data a classifier trained with 100% of the English, German, and Italian training data and with variable proportions of the French training data. We compare the results with those obtained (using the same experimental setup) by the Naïve approach (see Sections
1 and
4.1) and by
Fun[
20].
It is immediate to note from the plots that the two baseline systems have a very low performance when there are few target-language training examples, but this is not true for gFun-WMB, which has a very respectable performance even with 0% target-language training examples; indeed, gFun-WMB is able to almost bridge the gap between the zero-shot and many-shot settings, i.e., for gFun-WMB the difference between the \(F^1_M\) values obtained with 0% or 100% target-language training examples is moderate. On the contrary, for the two baseline systems considered, the inclusion of additional target-language training examples results in a substantial increase in performance; however, both baselines substantially underperform gFun-WMB, for any percentage of target-language training examples, and for each of the four target languages.
4.9 Precision and Recall
In this section, we look at precision and recall for individual languages, as obtained by gFun, with the goal of investigating if any significant language-specific pattern emerges.
Figures
9 and
10 display precision and recall (in both their macro- and micro-averaged versions) obtained for the best-performing setting (-XWB) of
gFun, in one run on RCV1/RCV2 (Figure
9), and one run on JRC-Acquis (Figure
10).
The main observation that can be made by observing these figures is that, for each language and for each dataset, average precision is always invariably higher than average recall. This can be explained by the fact that all our datasets are imbalanced at the class level (i.e., for each class the positives are far outnumbered by the negatives). In these cases, it is well known that a learner that optimizes for vanilla accuracy (or for a proxy of it, such as the hinge loss, which is our case) tends to err on the side of caution (i.e., choose a high decision threshold); after all, on a test set in which, say, 99% of the examples are negatives, classifying all the unlabelled examples as negatives (which is the result of an extremely high decision threshold) rewards the classifier with an extremely high value of vanilla accuracy, i.e., 0.99. In other words, imbalanced data plus hinge loss as the loss to minimize means high decision threshold, which in turn means, quite obviously, higher precision and lower recall. As mentioned above, this tendency is displayed essentially by all languages and for both datasets.