Abstract
In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups. Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions. In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance. We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Pre-trained language models are widely used, such as for translation (Wu et al., 2016), opinion summarization (Zhu et al., 2013; Farzindar, 2014) and chatbots (Shawar and Atwell, 2007). For example, OpenAI’s ChatGPT (OpenAI, 2023) chatbot is now widely accepted and applied in real life, for instance, in the education industry for lecture design (Extance, 2023), news article production (Liu, 2022) and content creation (Cao et al., 2023). Such powerful chatbots, along with other large language models, have been pre-trained on large collective training datasets, mostly from online resources (Devlin et al., 2019; Brown et al., 2020).
Language models learn arrays containing rich features, which are called word embeddings. Recent studies indicate that embeddings exhibit systematic patterns of stereotype discrimination, mirroring human biases (Zhang et al., 2021; Wolfe and Caliskan, 2021; Nadeem et al., 2021). For example, the word embeddings illustrate a significantly higher probability of ‘he’ as the predicted pronoun for a surgeon in a simple sentence. In contrast, word embeddings demonstrate a notably elevated probability of predicting ‘she’ as the pronoun associated with a nurse (Kumar et al., 2020).
Language is intertwined with culture due to differences in word usage behavior (Loveys et al., 2018), writing styles (Ma et al., 2022), common sense knowledge, debatable topics, and value systems (Hershcovich et al., 2022). Research has shown that these demographic differences in the task domain will harm the performance of downstream Natural Language Processing (NLP) tasks (Ma et al., 2022; Ghosh et al., 2021; Sun et al., 2021; González et al., 2020; Tan et al., 2020; Loveys et al., 2018).
An additional source of bias is differences in the amount of data each group contributes to a dataset. For example, most massive datasets are collected online. Regions with smaller populations than others contain fewer online users and are hence underrepresented in the training data. Particularly, Zhang et al. (2021) show that most of the word embeddings reflect more of the language habits of European-educated males, neglecting other subsets of the population. This constitutes a biased selection of the population (Hershcovich et al., 2022; Ma et al., 2022) and raises concerns about the non-selected groups’ representation within the dataset (Hershcovich et al., 2022; Wolfe and Caliskan, 2021), which will probably cause harms in applications.(Ghosh et al., 2021; Tan et al., 2020; González et al., 2020).
While research mostly focuses on cross-culture problems in cross-lingual language models, the monolingual model—English in this case—may not be free of culture difference bias either (Hershcovich et al., 2022). Recent studies on English language models focus on the geographic influence of non-traditional English-speaking (L2) regions (Tan et al., 2020; Ghosh et al., 2021), e.g., researchers discovered differences in emotional responses (Ghosh et al., 2021) and word usage (Ma et al., 2022) between L2 regions. Others investigate bias in models on NLP tasks for different social groups within one region (Zhang et al., 2021).
However, Ma et al. (2022) and other research reflect bias on resource-abundant regions, such as the US or India, where labeled datasets for NLP tasks are easier to access. In contrast, this paper facilitates the inclusion of resource-limited regions by reducing the reliance on labeled task data by investigating regional bias on word embedding level before fine-tuning, which can influence the downstream task performance. Here, we focus on the ‘sequence output’ of BERT, where the probability distribution of each word is concatenated into a sequence array. Our proposed framework is not tailored to regional bias and can be used to investigate different sources of bias while including resource-limited groups. However, to demonstrate its power and close a research gap simultaneously, we focus on regional bias within the inner-circle English group of L1 regions that have previously been neglected by research. We ask the following research questions:
-
1.
Do regional differences in raw text data manifest in embedding space?
-
2.
What impact do regional differences have on the performance of downstream tasks?
The regional differences identified in RQ1 may arise from diverse topics or variations in word usage frequencies across different regions. Examining these variations serves not only to elucidate regional patterns but also offers valuable insights that can guide bias mitigation efforts, relevant to the bias shown by the impact on downstream task in RQ2.
We approach the research questions using two different standard datasets in NLP, Sentiment140 and Reuters21578, containing tweets and news articles from six and four L1 English-speaking countries, respectively. We find that regional differences are indeed manifested in BERT embedding space as well as in regional feature space. Furthermore, these differences affect the performance of downstream learning tasks. Particularly, we investigated sentiment classification and multilabel classification on the corresponding datasets and found significant drops in performance for underrepresented regions in relation to the test set performance. These results imply that differences in embedding space indicate model performance gaps and are hence a suitable tool to analyze bias while including resource-limited groups.
The remainder of this article is organized as follows: Sect. 2 provides preliminary information for our methodology in Sect. 3. Section 4 introduces the experimental setup. Section 5 analyzes the results. Finally, Sect. 6 reviews related research before Sect. 7 concludes this paper.
2 Preliminary and Notation
Before diving into the details of our methodology, we outline the notation and provide brief preliminary definitions required for the remainder of this article. More details can be found in Appendix 1.
Notation Let \(\Phi \) denote a pre-trained language model, \(\Phi : W \rightarrow S\), where W and S denote the input and the output of \(\Phi \). Likewise, let \(\Omega \) denote a downstream NLP task model, \(\Omega : S \rightarrow T\), with input S and output T. Then let \(D_{s}\) and \(D_{t}\) denote the metric functions (distance functions) on S and T respectively: \(D_{s}: S \times S \rightarrow {\mathbb {R}}\), \(D_{t}: T \times T \rightarrow {\mathbb {R}}\). This paper aims to show that differences in embedding space, \(D_{s}(\Phi (W_{i}), \Phi (W_{j}))\), cause differences in the downstream NLP task performance, \(D_{t}(\Omega (\Phi (W_{i})), \Omega (\Phi (W_{j})))\), where \(W_{i}\) and \(W_{j}\) denote two group domains for \(\Phi \).
Wasserstein distance and Sinkhorn algorithm Wasserstein distance is a measure derived from the optimal transport problems, estimating the effort of transforming one shape into another. It can be used to measure the difference between two probability distributions, such as region-specific data distributions in embedding space (Cai and Lim, 2022). The Sinkhorn algorithm (Chizat et al., 2020) allows for efficient calculation of the Wasserstein distance.
Linear discriminant analysis (LDA) Linear discriminant analysis (LDA) is a statistical method for finding new features (as linear combinations of the original features) that best discriminate between classes (in our case: regions). These new features constitute the axes of a new space, typically of a smaller dimension, which we refer to as the LDA space. LDA operates on the dataset alone using the regional labels and does not need a specific task.
Distance correlation Distance Correlation measures the strength of the dependency between two variables, even if their dimensions differ (Edelmann et al., 2021). In contrast to Pearson’s correlation, distance correlation can capture nonlinear associations between variables. The population distance correlation is zero if and only if the variables are independent. This paper chooses distance correlation to measure the dependency of the task performance of different regional groups. Distance correlation ranges from 0 (independence) to 1(perfect linear dependancy).
Relation measurement To examine the relation between distances in embedding space and distances in downstream task performance, we use Spearman’s rho and Kendall’s \(\tau \) rank correlation coefficients. \(\rho \) assesses the strength and direction of the relationship between two variables by examining how their ranks change, not their actual values. \(\tau \) calculates directly how many pairs of data points agree or disagree in their order, making it less sensitive to tied values. Both coefficients range from -1 (strong negative correlation) to 1(strong positive correlation) via 0 (independence).
3 Methodology
To investigate bias in pre-trained language models and analyze its impact on a specific dataset, we propose a methodology that is specifically tailored to include all sub-populations, even those with limited access to task-labeled datasets.
Figure 1 depicts the procedural flow of the proposed method, which compares regional features in text data X in three distinct ways:
-
I.
observe differences in performance for a downstream task y to quantify the impact of both regional bias and embedding bias,
-
II.
measure distances in embedding space tracing the regional differences back to the embedding, and
-
III.
measure distances in LDA space to disentangle the effect of regional and embedding bias.
The subsequent Results Analysis stage examines the correlation between differences in intrinsic metrics (derived from I) and extrinsic metrics (derived from II and III). It is worth noting that this methodology focuses on intrinsic features obtained directly from the model’s output and word embeddings, diverging from the conventional practice of analyzing intermediate outputs of model layers (Leteno et al., 2023) or prediction probabilities over words for tasks like text classification (Nadeem et al., 2021; Lauscher et al., 2021) and text generation (Sun et al., 2022). This methodolgy helps illustrate the relationship between the knowledge features in models and the task performance in a straight forward way.
Additionally, our approach recommends the use of LDA as a tool to identify the feature subspace crucial for distinguishing group features. In essence, this method successfully unveils the intricate relationship between intrinsic and extrinsic metrics, especially in the context of multi-group data, such as multiple regional groups in our case study. Notably, the intrinsic metrics derived herein can serve as effective evaluation metrics for addressing bias in pre-trained language models. Subsequently, we discuss each involved aspect in detail.
Performance To measure performance differences, a subset of text data X is employed as the training data to fine-tune the pre-trained LLMs, incorporating an additional task performance layer tailored to the specific task y. Simultaneously, a subset of X serves as the test data, subjected to the fine-tuned models to yield comprehensive performance results. The performance evaluation metrics include accuracy, AUC (area under the ROC curve), precision scores, and recall scores. Subsequently, gaps (score differences) in these evaluation metrics are calculated on a group-wise basis. The type of fine-tuning layer is dynamically determined by the nature of the task y. If there exists more than one fine-tuned model, the distance correlation of these performance metrics is calculated.
Distance correlation measures the dependency of performance variance patterns of two datasets. Performance variance, in this context, denotes the fluctuations in performance on a test dataset resulting from alterations in the testing model. Two similar datasets are expected to have similar patterns of performance variance. For example, one model improves the performance on one dataset A by \(f\%\). Then the performance improvement is expected to be close to \(f\%\) when the model is performed on dataset B if dataset A and dataset B are similar in features. This measure indicates the reliance of model performance estimation based on the standard test datasets.
Performance drops between regions can be the cause of two different types of biases: true regional bias inherent in the dataset and bias that is already encoded in the embedding. To isolate embedding bias, we subsequently calculate distances in embedding space as well as in LDA space which circumvents using the embedding.
Embedding To measure embedding distances, the sequence embeddings, where the probability distribution arrays of words are concatenated into one array by the order in the word sequence, are extracted for the test data by regional groups. WWe opt for Wasserstein distance instead of the commonly used Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence. Wasserstein distance is preferred because it relaxes the assumption that two probability distributions are measured in the same space (detailed justification can be found in Appendix 1). Due to this property, Wasserstein distance can be generalised to other context such as different languages where he presence of a common subspace is not guaranteed. We then compute the Wasserstein distance in embedding space for any pair of regional groups using the Sinkhorn algorithm. This approach captures the nuanced differences in BERT embeddings, thereby contributing to the understanding of regional distinctions within the text data. Large regional differences in embedding space indicate large differences in word usage between those regions, whereas regions that are close in embedding space use similar language.
LDA In the embedding aspect, embeddings are extracted again for the test data by regional groups. These embeddings are projected into LDA space to calculate their group distances, if applicable. Earlier studies (Zhao et al., 2019; Bolukbasi et al., 2016) have identified specific feature spaces using embeddings within the given feature set. For instance, Zhao et al. (2019) utilized gender-related sets (e.g., woman, girl, man, boy) to define the gender feature space. The present research aims to explore variations across multiple groups within the embedding space. LDA is employed to discover the feature space that effectively distinguishes between groups. LDA conducts this exploration without prior knowledge of the feature space and is adaptable to multiple-class data from binary class data.
In LDA space, the axes represent linear combinations of features, from embeddings in this context, with an emphasis on maximizing the distance between groups. The adoption of LDA techniques in this method serves the purpose of identifying the space that accentuates group identity features. It is essential to note, however, that LDA is not universally applicable. Its effectiveness is contingent upon equal group sizes. In instances of uneven group distribution, this paper employs a down-sampling strategy. LDA faces challenges in the Reuters21578 datasets due to a total data sample size that falls below the count of features. Despite these limitations, LDA remains a valuable technique for visualizing group identity features.
Results analysis In the phase of results analysis, the group performance gaps computed in the performance phase are juxtaposed with the group distances identified in embedding and LDA. To assess the presence of bias in the embedding space, correlation coefficients such as Spearman’s \(\rho \) and Kendall’s \(\tau \) are employed. An impact on the downstream performance task is considered when the correlation strength is moderate to strong, accompanied by significant p-values.
4 Experimental setup
To scrutinize regional biases within large language models using our proposed methodology, we conduct experiments on two distinct datasets tailored for different NLP tasks. This section introduces the data we used and the experiment procedure, including pre-processing, training, and evaluation. We provide our implementation, results, and scripts in our repository, alongside supplementary materials containing additional details and results: https://github.com/anniejlu/regional_bias.
Datasets and preprocessing The availability of publicly accessible datasets for NLP task evaluation containing geographic information is limited. This experiment relies on two well-established datasets: Sentiment140 (training) (Go et al., 2009) and Reuters21578 (Apt’e et al., 1994). These datasets serve as standards for sentiment analysis and document multi-labeling tasks, respectively. See Table 1 for a dataset overview and Fig. 2 for the distribution of regions within each dataset.
The Sentiment140 training sets comprise 1.6 million tweets with sentiment labels. The labeling process is performed using Go et al. (2009)’s classifier, which relies on emoticons present in the tweets eliminating the need for human labeling. We obtain binary labels (positive and negative). All special characters and emoticons are removed from the text. Figure 3 demonstrates the process of filtering and sampling applied to the Sentiment140 dataset.
For performance evaluation (I), we randomly select five training sets, each containing 16,000 tweets. Subsequently, we use 5-fold cross-validation to split each set further. Four folds are utilized as training data, while the remaining fold is used as test data, serving as one of the baseline datasets for performance evaluation.
We randomly select a sample of 100,000 tweets from the entire dataset for the extraction of regional data based on the location content from the L1 countries Australia (AU), Canada (CA), New Zealand (NZ), the United Kingdom (UK), the United States (US), and South Africa (ZA). See Appendix “Data preprocessing” section for details.
Here, the “mixed” dataset refers to the combined set of all data with identified regions. We have 28,636 tweets together for the regional test dataset.
Tweets are particularly short, which causes problems in our LDA analysis (see Appendix 3.2 for details). To overcome these problems, we generate long-texts from the Sentiment140 data by randomly picking 10 tweets, maximising the use of 128-token sequence size in the model, to constitute a corpus in each iteration. Repeat this procedure 100 times for statistical stability. For each pair of comparisons, the embeddings of these long-text tweets are combined, and we measure the differences in their concatenated embeddings. We generally distinguish between short-text (uses the original tweets) and long-text embeddings and indicate which one we use.
Reuters21578 is a multilabel dataset for business news in Reuters News. There are 119 document topic labels such as “trade”, “crude”, and “nat-gas”. Each document can have one or multiple labels. News articles without any topic labels are removed. We then preprocess every article by removing newline characters and replacing tab spaces with single spaces. We construct a regional dataset by gathering data where the “place” field contains a single value, operating under the assumption that the authors of the news articles originate from the respective publishing countries. Articles labeled as “Multiple” denote publications in the five focus regions, as well as in multiple areas that encompass these five regions. We follow the train-test setting pre-defined by Apt’e et al. (1994). Figure 4 shows the details of train-test split of the dataset.
Sampling To save computational costs, for the model training, we randomly draw 5 samples with 16,000 tweets for Sentiment140 as illustrated in Fig. 3. For each sample, we use 5-fold cross-validation to get 5 training data samples (12,000 tweets) and 5 test data samples (4000 tweets). The test data samples here serve as the baseline for the comparison of model performance for different region datasets, labeled as “baseline" in the plots in Sect. 5. Due to dataset size constraints in Reuters21578, we construct a single pre-defined training data sample and test data sample as per Apt’e et al. (1994). The test data sample, denoted as “Test" in subsequent plots, serves as the benchmark for performance comparison. As shown in both Figs. 3 and 4, we construct regional datasets with stratified sampling. Equal-sized samples are drawn from each region. We also create a “Mixed/Multiple" sample region by drawing samples exclusively from regions where all focused regions are known in terms of distribution, serving as a baseline for datasets with known regional distributions.
Embedding and Model Training We choose BERT (Devlin et al., 2019) as the embedding in our experiments and proceed as depicted in Fig. 1: We fine-tune the BERT model for the downstream task (I) and project the test datasets into embedding space (for II and III). Aiming to simulate the routine task training and testing. We use unstratified dataset samples as training data and baseline test dataset. The results on baseline datasets represent the model performance generally. For Sentiment140 data, an uncased English BERT with 4 hidden layers of 512 neuron size with 8 attention heads from the TensorFlow HubFootnote 1 is fine-tuned by a dense layer and a dropout layer.
For Reuters21578 data, the multilabel classifier training adheres to the original settings in DocBERT (Adhikari et al., 2019). Due to data limitations, the multilabeling experiment cannot accommodate cross-validation settings. Consequently, this experiment features a singular multilabel classifier.
We provide the concrete hyperparameter settings in Appendix “Hyperparameters”.
Evaluation For performance evaluation (I) on Sentiment140, starting from 28.636 tweets (see Fig. 3), We downsample each region to a sample size of 1000(except for NZ and AU, where we used all tweets), aligned with CA data,as well as a mixed sample of equal size containing all regions (in their original distribution) to reduce the computational cost of transferring the tweets to embedding space. The sampling process is repeated 30 times to allow for statistical stability, and we report average accuracy, area under the curve (AUC), precision, and recall. For Reuters21578, the test dataset size is manageable and does not need to be reduced prior to the embedding space transfer.
For embedding differences (II), we sample data for each region with equal sizes of 100 for Sentiment140 and down-sampled size of 35, aligned with AU data, for Reuters21578. Then repeat the process for 100 times. To emphasize the regional feature pattern, we aggregate the sequence embeddings into a longer sequence embedding. For instance, in Sentiment140, we concatenate 100 sequence embeddings to form a singular embedding, facilitating the measurement of Wasserstein distance.. We report the mean Wasserstein distances for each pair of regions, representing the embedding difference between the two regions.
For LDA distances (III), we randomly sample 100 tweets or artificial long-text tweets and 35 tweets from each region and mixed-region group for Sentiment140 and Reuters21578, respectively. We proceed to project the group data into a dimension-reduced space using LDA and compute the distance between the centroids of clusters for each pair of groups, utilizing Euclidean distance. We repeat the process 100 times, and the mean distance values serve as the LDA distance for analysis. However, LDA fails to project Reuters21578 data to a reasonable space due to its small data size. The total sample size is 210 (including Australia) or 500 (excluding Australia), which is smaller than the dimension, 512, of BERT embeddings.
5 Results and discussion
In the following section, we discuss the outcomes of our study, exploring how regional differences in raw text data reflect in the embedding space and examining their influence on the performance of downstream tasks. This discussion builds upon the methodology and experimental setup outlined in the previous sections. Flowing into the result analysis are three types of results as illustrated in Fig. 1: (I) Performance gaps and distance correlations as a result of training a model for the specific tasks, i.e., sentiment analysis for the Sentiment140 dataset and multi-class labeling for Reuters21578, (II) Wasserstein distances of regional groups within the embedding space, i.e., BERT in our case, and (III) distances of regional groups within LDA space.
We structure this section according to our research questions, “Do regional differences in raw text data manifest in embedding space?” and “What impact do regional differences have on the performance of downstream tasks?”.
5.1 RQ1: Do regional differences in raw text data manifest in embedding space?
It is well-known that underrepresentation of certain groups in the dataset can cause bias since the model will focus on the majority groups and neglect others as they do not contribute to the overall training error. Imbalance of regional groups is a problem in the datasets we use. Figure 5 illustrates the relationship between the proportion within the standard test data and the difference of BERT embedding with the standard test data. The embedding space difference is measured by Wasserstein distance (see computation details in A). It generally follows that the Wasserstein distance of regional data from the standard test data is smaller when it possesses a larger share of the standard data. In other words, a larger proportion of regional data in the standard datasets may exert an influence on the features of pre-trained BERT embeddings. There is one notable exception in the case of Canadian English, whose proximity to the standard test data is closer than anticipated in both datasets. One potential explanation is that Canadian English and global English share similar exposure to two mainstream English styles, American English and British English (Boberg, 2012). Further experiments are required to confirm this hypothesis.
We expected long-text data to contain more regional patterns and saw this confirmed in preliminary experiments (Appendix Fig. 12), so we use long-text subsequently. We acknowledge that the synthetic long-text data obfuscates the interpretation. The linguistic differences in L1 English regions are subtle. This framework aims to capture the pattern magnified by extended corpus sequence.
Aiming to investigate the embedding space more closely, we illustrate the differences between the region groups in that space in Fig. 6a. UK English appears more similar to CA, AU, and NZ English than the rest. South Africa data is closer to UK data compared with US data. These imply that British English emerges as the central point among the language groups. British English is close to Australian English and Kiwi English – this is to be expected given that AU and NZ were former British colonies. These three regions form a proximity group. American English is far from this group. Canadian English and South African English fall between the triangle group and American English. Canadian English is closer to American English (understandable, given the countries’ geographic proximity), while South African English is closer to British English (which can be explained by colonial influences).
Reuters21578 data is considered long-text data as well since it is a set of news articles data. Figure 6c demonstrates the region pattern in Reuters21578 data. Canada data and US data, together with standard test data and multiple region data, cluster closely together. Both Australian data and UK data are far away from this group. This might be a result of the dominance of American English features in the standard dataset as it occupies over \(50\%\) of the dataset as shown in Fig. 5.
The observed differences in embedding space can stem from various contributing factors, including distinct semantic features from different words. Employing the LDA method allows us to further discern patterns in regional features among different regions. As depicted in Fig. 6b, a similar pattern emerges. Furthermore, the correlation with performance evaluation measures aligns between LDA results and BERT embeddings for long-text, as evident in the comparable color patterns in Fig. 7. The LDA results affirm that the distinctive patterns identified in the long-text embedding space largely arise from regional feature differences. LDA fails on Reuters21578 data due to its small data size, as described in Sect. 4.
5.2 RQ2: What impact do regional differences have on the performance of downstream tasks?
Based on the discussion in the previous section, we know that regional differences do manifest in embedding space. In this section, we discuss the impact of the identified bias on the performance of downstream tasks.
We investigate the impact of regional bias on performance through pairwise correlations of BERT embeddings and performance measures for both datasets, Sentiment140 and Reuters21578. The correlation depicts the relationship between performance and regional feature differences in BERT embeddings and thus presents the impact of the regional bias. Figures 7 and 8 shows the correlation matrix (the p-value results of permutation tests for both coefficients are in Appendix Figs. 14 and C5). The impact on performance is observed from two perspectives: the performance scores and the performance responses to the change of models.
We evaluate the performance of the sentiment classification task with accuracy, AUC, precision scores, and recall scores. We measure the impact with correlation coefficients of Spearman and Kendall. The results in Fig. 7 show that long-text BERT embedding differences and LDA space distances have a moderate positive relationship with Accuracy and AUC for the sentiment classification task. This implies that the model performance on two datasets tends to have a larger discrepancy if they have distinct regional features. Recall that the differences from baseline data (standard test data) in embedding space are proportional to each region group’s representation level in the baseline data. The underrepresented region groups tend to have more distinct regional features from baseline data. Thus, the region groups with low representation power in standard test data tend to suffer a performance drop due to the regional feature differences.
The multi-label classification task shows a similar sign when excluding the performance on Australian data. The small size of AU data (35 documents) contributes to an unanticipated high performance (see in Fig. 10) despite large feature differences from the standard test dataset as shown in Fig. 6c. Besides, the differences in BERT embeddings have a moderate positive relationship with precision scores when short-text Sentiment140 has the same pattern. This suggests that when two datasets possess different features in the embedding space, the model’s performance on them, evaluated in precision scores, is likely to exhibit a more significant discrepancy. Reuters21578, news articles, holds both patterns of long-text embeddings and short-text embeddings because the model directly applies to the embeddings of documents that are long. No impact on recall scores has been found for both datasets.
The performance drop for the region groups due to regional bias is demonstrated in Figs. 9 and 10. Sentiment classification performance on New Zealand data is below the average. Model performance on Australian data and UK data is also worse than the baseline dataset. The sudden improvement of precision scores for positive labels can be explained by the imbalance classification distribution (see Table 2 in “Appendix 3.3”). No significant differences can be observed for the recall score, explaining no regional feature impact on the recall score.
In Fig. 10, multi-label classification on UK data has a substantial gap from US data. This can be explained by the large feature differences in the embedding space (see Appendix 3.3.2). The unexpected superior model performance on US data still persists when decreasing the proportion of US data in the training data until model training stops early due to data limitations. The resemblance between the Reuters21578 data (news articles) and the collective training set data for BERT training (mostly online articles) appears to hinder the model’s ability to grasp certain text features, such as word usage, during fine-tuning. This suggests a possible dominant influence of US-specific characteristics in the pre-trained BERT model. However, this conjecture needs further experiments to verify.
Lastly, we investigate the impact on the performance variation patterns due to the change of models on region data groups. The performance correlation examines whether a model exhibits similar enhancements or deterioration patterns on different datasets. When two datasets exhibit similarity, their performances are anticipated to align closely. Figure 7 shows a moderate negative relationship between the performance distance correlation and the long-text BERT embedding differences or long-text region group distances in the LDA space. It suggests that model performance variance tends to be similar when two region datasets have similar regional features. This observation may raise concerns about estimation model performance across diverse datasets. It suggests that having a better-trained model does not automatically guarantee improved prediction results for all datasets, especially when datasets exhibit distinct regional features. The performance variation examination results imply that some regional features entangle in the feature space for task performance. For example, the word “scheme” carries a negative connotation in the US but remains neutral in the UK. This divergence in sentiment could potentially impact the performance of sentiment prediction task on Sentiment140 dataset. This observation excludes the Reuters21578 dataset, as only one model is trained on it.
Overall, we observe regional feature differences in the embedding space. The variations from the baseline data (standard test data) captured in the embedding space are directly proportional to the representation level of each region group in the baseline data. Region groups with limited representation in standard test data are prone to experiencing a decline in performance due to regional bias. The investigation of the performance distance correlation indicates that possessing a well-trained model does not inherently ensure enhanced prediction outcomes across all datasets, particularly when those datasets manifest distinct regional features.
5.3 Summary
Differences in regional features within the embedding space can be quantified through distribution variances assessed by Wasserstein distance. Such differences illustrate three English language groups (US, UK-AU-NZ, CA-ZA) as shown in Fig 6. The observed performance disparities among regional groups, as indicated in Figs. 9 and 10, can be partially attributed to such regional feature differences, as evidenced by Spearman’s rho and Kendall’s tau in Figs. 7 and 8. The experimental findings underscore that these regional feature differences directly impact precision scores, accuracy, and AUC. Moreover, they demonstrate that such differences influence performance variations across diverse models fine-tuned on different datasets. It suggests that the estimation of model performance based on general datasets does not translate to regional datasets, which requires additional caution when using pre-trained LLMs, particularly in underrepresented regions.
6 Related work
This section discusses recent research within related disciplines. Section 6.1 explains the distinctive features of this study compared to existing research on regional biases. Section 6.2 provides an overview of other types of bias found in language models.
6.1 Studies investigating regional differences in LLMs
González et al. (2020) and Sun et al. (2021) delved into the regional impacts on multi-lingual language models (LLMs). González et al. (2020) reveal that languages featuring anti-reflective pronouns, like Swedish, may introduce unambiguous gender bias due to distinct grammar structures and national linguistic characteristics. They measure the bias by comparing the model performance on sentences with different types of pronouns, such as feminine pronouns. The result shows that language models behave worse in sentences with feminine pronouns. Sun et al. (2021) propose three metrics for assessing cross-cultural proximity: language context ratio, literal translation quality, and emotional semantic distance. These metrics illustrate how different languages behave differently. Inspired by the emotional semantic distance, this work seeks to highlight feature differences in English across various regions within the inner circle, despite the common perception of shared culture among these inner-circle regions. Overall, this paper focuses on English language models, monolingual language models, while the above articles focus on multi-lingual language models.
Numerous studies concentrate on the regional influences on monolingual language models, particularly those focused on English. Tan et al. (2020) employ an inflectional perturbation strategy to generate adversarial attacks on pre-trained language models, simulating the language behaviours of second-language English speakers (L2). Ghosh et al. (2021) identify biases in existing toxicity detection models, noting a tendency to favor offensive words from underrepresented non-standard English (L2) regions. They evaluated the toxicity score sensitivity of country-specific words. The results show that the off-shelf toxicity detection model can weakly detect toxic country-specific slang words probably unseen during training. Instead of investigating the model behavior discrepancy in text generation tasks or toxicity detection tasks for L2 English speakers, we explore the difference for L1 (first language) English speakers.
Zhang et al. (2021) demonstrate that pre-trained language models exhibit a bias towards the “white man with a high education level" on cloze-tests data in the US, with the exception of BERT-large models. The authors compare the model selection with human selection and show that model selection overlaps more with highly-educated white man’s selection from the US community. It raises concern about minority social group bias in the existing language model. It investigates model performance differences in English speakers from different cultures in one region, mostly in the United States. In contrast, this paper aims to study differences in regional identities, which include all demographic groups within a region in different geographic areas.
Ma et al. (2022) propose a benchmark dataset for five English-speaking regions and illustrate the word usage difference and performance difference before and after learning regional features. They investigate the regional differences at the performance level, while this paper tries to illustrate bias at the intrinsic level, i.e., differences in embedding space. Nevertheless, research by Ma et al. (2022) points to bias in resource-abundant regions. This paper strives to diminish dependence on labeled task data by examining regional bias at the word embedding level before the fine-tuning process, potentially impacting downstream task performance. However, access to the proposed dataset is still not available to the public yet.
6.2 Beyond regional bias in LLMs
Observation of performance difference is one of the methods to quantify gender bias in NLP and is frequently expressed via accuracy, F1-scores, log-loss of the probability, and false positive rate (Stanczak and Augenstein, 2021). In this paper, we incorporate accuracy, AUC for binary prediction, precision, and recall to quantify bias in performance results but discard the F1-score due to its sensitivity to unbalanced datasets.
Another popular way to illustrate bias, especially stereotype bias, is to apply language models on a test template (Lauscher et al., 2021; Nadeem et al., 2021) where words with stereotype attributes are missing for prediction. Then bias is presented by comparing the predictions. However, template prediction does not seem effective in investigating regional difference bias as it is difficult to construct a lexical that properly reflects the differences between different regions.
Finally, following Bolukbasi et al. (2016)’s method, Zhao et al. (2019) measure gender with a gender axis, where gender-related words should cluster at two ends of the axis. Such an axis is found by finding a space where anchor sets (such as man:woman) have the largest variation. This method is tied to biases of features with two values, such as gender, and is not suitable for features with more values. To lift this constraint for our research, we use LDA to find the feature axis for multi-class features.
7 Conclusion and future work
Pre-trained language models are widely used despite being biased towards specific social and geographic groups. Particularly for regional bias, research focuses on L2 (English as a second language) regions but neglects bias within L1 (first language) regions, typically excluding low-resource regions entirely.
This paper introduced a novel approach for detecting bias in word embedding spaces for L1 regions characterized by similar culture and language behaviors, with a specific emphasis on addressing the challenges posed by low-resource regions. However, our proposed framework using Wasserstein distance in embedding space and LDA projection is general enough to extend to other types of bias.
We apply our framework to two specific datasets with two distinct downstream tasks. Our study demonstrates that regional bias (1) manifests in embedding space and (2) strongly impacts downstream task performance. When English language features exhibit greater distinctions between two regions, the performance gap of the model on datasets from these regions widens. Regions that are underrepresented in standard data sources or for which language features differ are particularly susceptible to regional bias and performance drops.
The findings indicate the importance of evaluating the performance of LLMs across diverse test datasets, as model efficacy can fluctuate significantly due to variations in feature distributions. Additionally, interpreting the results of LLMs necessitates careful consideration of whether the training datasets accurately reflect the characteristics of the target application data. For us as a research community, our findings imply that we need to dedicate further effort to uncovering hidden effects caused by regional bias in LLMs and to mitigating such bias.
Future research efforts will prioritize the generalization of our methodology and explore mitigation strategies for addressing biases within the inner group of the English language. This includes developing robust approaches to counteract biases in regions where language features deviate from the established standard data source. Another promising avenue of research is to extend the regional bias investigations to alternative representation formats, such as graph embeddings.
Our framework incorporates the embedding space to identify distinct features within groups, along with downstream tasks for evaluating performance. Wasserstein distance and LDA space distance are employed to quantify these distinctive group features. Bias is identified when task performances exhibit moderate or strong correlations with group feature differences, as indicated by significant p-values. In future research, we plan to explore the binary decision threshold for more refined bias detection.
Availability of data and materials
All datasets are publicly available and referenced in the paper.
Code availability
Available at https://github.com/anniejlu/regional_bias.
References
Adhikari, A., Ram, A., Tang, R., & Lin, J. (2019). DocBERT: BERT for document classification. arXiv preprint arXiv:1904.08398
Apt’e, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(1994), 233–251.
Boberg, C. (2012). Standard Canadian English. Standards of English: Codified varieties around the world (p. 159).
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (vol. 29).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. In Advances in neural information processing systems (vol. 33, pp. 1877–1901).
Cai, Y., & Lim, L. H. (2022). Distances between probability distributions of different dimensions. IEEE Transactions on Information Theory, 68(6), 4020–4031.
Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P. S., & Sun L. (2023) A comprehensive survey of AI-generated content (AIGC): A history of generative AI from GAN to chatGPT. arXiv preprint arXiv:2303.04226
Chizat, L., Roussillon, P., Léger, F., Vialard, F. X., & Peyré, G. (2020). Faster Wasserstein distance estimation with the Sinkhorn divergence. In Advances in neural information processing systems (vol. 33, pp. 2257–2269).
Devlin, J., Chang, M. W., Lee, K., & Toutanova K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Minneapolis: Association for Computational Linguistics
Edelmann, D., Móri, T. F., & Székely, G. J. (2021). On relationships between the Pearson and the distance correlation coefficients. Statistics & Probability Letters, 169(108), 960.
Extance, A. (2023). ChatGPT has entered the classroom: How LLMs could transform education. Nature, 623, 474–477.
Farzindar, A. (2014). Social network integration in document summarization. In Digital Arts and entertainment: Concepts, methodologies, tools, and applications (pp. 746–769). IGI Global
Ghosh, S., Baker, D., Jurgens, D., & Prabhakaran, V. (2021) Detecting cross-geographic biases in toxicity modeling on social media. In Proceedings of the seventh workshop on noisy user-generated text (W-NUT 2021), (pp. 313–328).
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1(12):2009.
González, A. V., Barrett, M., Hvingelby, R., Webster, K., & Søgaard, A. (2020). Type b reflexivization as an unambiguous testbed for multilingual multi-task gender bias. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 2637–2648).
Hershcovich, D., Frank, S., Lent, H., de Lhoneux, M., Abdou, M., Brandl, S., Bugliarello, E., Piqueras, L. C., Chalkidis, I., Cui, R., & Fierro, C. (2022). Challenges and strategies in cross-cultural NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 6997–7013).
Kachru, B. B. (1985). Standards, codification and sociolinguistic realism: The English language in the outer circle. Cambridge: Cambridge University Press.
Kumar, V., Bhotia, T. S., Kumar, V., & Chakraborty, T. (2020). Nurse is closer to woman than surgeon? mitigating gender-biased proximities in word embeddings. Transactions of the Association for Computational Linguistics, 8, 486–503.
Lauscher, A., Lueken, T., & Glavaš, G. (2021). Sustainable modular debiasing of language models. In Findings of the Association for Computational Linguistics: EMNLP, 2021 (pp. 4782–4797).
Leteno, T., Gourru, A., Laclau, C., & Gravier, C. (2023). An investigation of structures responsible for gender bias in BERT and DistilBERT. In International symposium on intelligent data analysis (pp. 249–261). Springer.
Liu, G. (2022). The world’s smartest artificial intelligence just made its first magazine cover. Cosmopolitan
Loveys, K., Torrez, J., Fine, A., Moriarty, G., & Coppersmith, G., (2018) Cross-cultural differences in language markers of depression online. In Proceedings of the fifth workshop on computational linguistics and clinical psychology: from keyboard to clinic (pp. 78–87).
Ma, W., Datta, S., Wang, L., & Vosoughi, S. (2022). EnCBP: A new benchmark dataset for finer-grained cultural background prediction in English. In Findings of the Association for Computational Linguistics: ACL, 2022 (pp. 2811–2823).
Nadeem, M., Bethke, A., & Reddy, S (2021) StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers) (pp. 5356–5371).
OpenAI (2023) ChatGPT. https://chat.openai.com/chat
Peyré, G., Cuturi, M., et al (2017) Computational optimal transport. Center for Research in Economics and Statistics Working Papers (2017-86)
Santambrogio, F. (2015). Optimal transport for applied mathematicians. Birkäuser, NY, 55(58–63), 94.
Shawar, B. A., & Atwell, E. (2007). Chatbots: Are they really useful? Journal for Language Technology and Computational Linguistics, 22(1), 29–49.
Stanczak, K., & Augenstein, I. (2021) A survey on gender bias in natural language processing. arXiv preprint arXiv:2112.14168
Sun, J., Ahn, H., Park, C. Y., Tsvetkov, Y., & Mortensen, D. R. (2021). Cross-cultural similarity features for cross-lingual transfer learning of pragmatically motivated tasks. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online (pp. 2403–2414).
Sun, T., He, J., Qiu, X., & Huang, X. (2022) BERTScore is unfair: On social bias in language model-based metrics for text generation. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 3726–3739)
Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6), 2769–2794.
Tan, S., Joty, S., Kan, M. Y., & Socher, R. (2020) It’s morphin’ time! Combating linguistic discrimination with inflectional perturbations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (pp. 2920–2935).
Villani, C. (2009). Optimal transport: Old and new (Vol. 338). Springer.
Wolfe, R., & Caliskan, A. (2021) Low frequency names exhibit bias and overfitting in contextualizing language models. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 518–532).
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., & Klingner, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144
Zhang, S., Zhang, X., Zhang, W., & Søgaard, A. (2021). Sociolectal analysis of pretrained language models. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 4581–4588).
Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V. & Chang, K. W. (2019) Gender bias in contextualized word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Volume 1 (Long and Short Papers) (pp. 629–634). Association for Computational Linguistics.
Zhu, L., Gao, S., Pan, S. J., Li, H., Deng, D. & Shahabi, C., (2013) Graph-based informative-sentence selection for opinion summarization. In Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining (pp. 408–412).
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions. Katharina Dost is funded by a Ministry of Business, Innovation & Employment New Zealand Smart Idea grant.
Author information
Authors and Affiliations
Contributions
Jiachen Lyu designed and wrote the experiment, wrote initial draft. Katharina Dost contributed to figures, provided suggestions and edited draft. Jörg Wicker and Yun Sing Koh provided suggestions and edited draft.
Corresponding author
Ethics declarations
Conflict of interest
We do not declare any conflicts.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Sarunas Girdzijauskas.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Preliminaries
1.1 Wasserstein Distance and Sinkhorn Divergence
Wasserstein distance is a measure derived from the optimal transport problems, estimating the effort of transforming one shape into another. The Wasserstein distance of \(m^{th}\) moment measuring the difference of two probability distributions, with \(Q_{S_{i}}\) and \(Q_{S_{j}}\) on \({\mathbb {R}}^d\) is defined as:
where \(\Pi (Q_{S_{i}},Q_{S_{j}})\) represents the joint distribution, u, v are random variables from the joint distribution.
The square of the second moment of Wasserstein distance (\(\text{Wasserstein}^{2}_{2}\)) can utilize the geometric features of the distributions (Chizat et al., 2020; Villani, 2009; Santambrogio, 2015; Peyré et al., 2017). Then the function becomes:
However, the computation cost for \(Wasserstein^{2}_{2}\) is expensive. The Sinkhorn divergence computation, proposed by (Chizat et al., 2020), approximates \(Wasserstein^{2}_{2}\) at a lower computation cost. Due to the property of the convex optimization, the Wasserstein distance problem be transformed into the following:
where \(\beta H(S_i, S_j)\) is the constraint term for the optimization problem, which is the optimal transport cost plan problem. Then the Sinkhorn distance is defined as (Chizat et al., 2020):
Justification of choice Kullback–Leibler (KL) divergence:
and the Jensen-Shannon (JS) divergence:
are two of the popular measure for the difference of two distributions. JS divergence improves KL divergence’s failure of (1) symmetry and (2) \(KL < \infty \)in all cases. Both KL and JS have vanishing gradient when \(|\{w_{S_{i}}:w_{S_{i}} \sim Q_{{S_{i}}}\} \cap \{w_{S_{j}}: w_{S_{j}} \sim Q_{S_{j}}\}| < \epsilon \) for an arbitrary small number \(\epsilon > 0\) as they are derived from cross-entropy.
Recall Wasserstein distance’s equation, by definition, is a convex function. The subgradient of a convex function always exists, and the convex function reaches its optimal when the subgradient is 0. In other words, Wasserstein distance always exists for any two distributions. Therefore, this experiment will use 1-Wasserstein distance to measure the domain distance or distribution variation before and after the proposed mapping.
1.2 Linear discriminant analysis (LDA)
Linear discriminant analysis (LDA) is a statistical method to identify features that distinguish the difference between groups, a Bayes classifier will be approximated based on Bayes theorem.LDA creates new variables where each derived variable is a linear combination of the original features with maximum F-ratio, the ratio of the sum of variance between groups over the sum of variance within groups. In the Scenario of discriminating binary classes:
where \(\mu _i\), \(\mu _j\) denote the class means; \(Scatter_i\) and \(Scatter_j\) denote the class scatter matrix; \(Scatter_{Between}\) and \(Scatter_{Within}\) denote the between group scatter matrix and the within-group scatter matrix respectively; \(\alpha \) denote the coefficients vector of derived canonical variate from the original variables; symbol " \(\widetilde{}\) " represent the situation in the projected dimension-reduced space.
The process is similar for the case where the number of class labels is greater than 2. A projection matrix, rather than a projection vector, will project different data classes into the derived subspace. As a result, the optimal projection matrix consists of eigenvectors corresponding to the largest eigenvalues of the canonical variate in columns.
1.3 Distance correlation
Correlation is a statistical metric showing the dependency relationship between two variables. The dependency can be observed through their variation when explainable factors change. For instance, two variables have positive dependencies when they vary in the same direction and vice versa. The figure of correlation r represents the strength of the dependency. The more similar their variation is, the larger the absolute value is. This paper aims to observe the changes in the performance of different region groups caused by different performance models.
This paper chooses distance correlation to measure the dependency of the task performance \(T_i, T_j\) of different region groups instead of correlation as \((1-correlation)\) fails to meet the triangle inequality (Székely et al., 2007; Edelmann et al., 2021), one property of a metric measure. The distance correlation is defined as the normalized version of the distance covariance:
where the distance covariance is defined as:
1.4 Relation measurement
To measure the relation between the metric space of \(D_{s}(\Phi (W_{i}), \Phi (W_{j}))\) and the metric space of \(D_{t}(\Omega (\Phi (W_{i})), \Omega (\Phi (W_{j})))\), we use Spearman rho, \(\rho \), and Kendall Tau, \(\tau \), non-parametric statistical method, to measure the strength of the relation between two metric spaces. Spearman rho calculates the correlation coefficient based on the ranking of the variables instead of the values. Let X, Y denote the random variable from spaces \((D_{s},\Phi (W_i))\), \((D_{t}, \Omega (\Phi (W_i))\) respectively. Spearman rho is defined as:
Meanwhile, Kendall Tau subtracts the number of pairs of opposite directions from the number of pairs of the same direction for arbitrary pairs of data, which is defined as:
Appendix 2: Implementation details
1.1 Data preprocessing
For Sentiment140, we randomly select a sample of 100,000 tweets from the entire dataset for the extraction of regional data. The location data extraction costs 128 h together. To transform location data into region labels, the Nominatim API in the ’geopy’ package is employed. This API retrieves the country code based on the location content, which may include either the district name or GPS coordinates. In instances where location fields are populated with descriptions, the Stanford NER (Named Entity Recognition) tagger is utilized to identify the place name, which is then sent to Nominatim for country code matching. 36,555 tweets are identified with valid geographic information. The whole process involves 60 h.
This experiment aims to uncover internal distinctions within BERT among ‘inner-circle’ English speakers (Kachru, 1985). Consequently, only data from Canada, Australia, New Zealand, the UK, the US, and South Africa is retained. Utilizing the identified country codes, rows corresponding to the country codes ‘ca,’ ‘au,’ ‘gb,’ ‘nz,’ ‘us,’ and ‘za’ are removed, representing the regions of Canada(‘ca’), Australia(‘au’), the United Kingdom(‘gb’), New Zealand(‘nz’), the U.S.(‘us’), and South Africa (‘za"), respectively.
1.2 Hyperparameters
See Table 3.
Appendix 3: Additional results
1.1 NZ-ZA distance in LDA space
Figure 11 is an example picture of short-text BERT embedding of different region data juxtaposed against their distances in the LDA space. The embedding difference between New Zealand and South Africa data is comparatively small (to the left of the plot). To enhance the overall clarity in subsequent plots, we exclude the NZ-ZA comparison from most plots.
1.2 Validation of the LDA space
LDA is trained to maximize the separation between country groups, but there is a possibility that this process might result in the loss of essential information about the original factors. We compare the distances between countries in BERT embedding space (x-axis) and in LDA space (y-axis) to address this concern, as shown in Fig. 13. The LDA space retains information from BERT embeddings when a clear and interpretable relationship exists.
This subsection only focuses on Sentiment140 as LDA encounters problems with the small sample size of Reuters21578. Recall that there are two types of BERT embeddings for sentiment analysis data, short text and long text. Examining short-text data, the direct application of the model reveals explicit feature differences in embeddings. When artificial long-text data is introduced to mitigate the sparsity of embeddings, it has the potential to unveil implicit features captured by language models that may not be apparent in sparse embeddings
Figure 12 illustrates how regional data is distributed in the projected space by predicting region labels with short-text data and long-text data. LDA encounters challenges in distinguishing groups when applied to short text embeddings due to their sparsity and limited information content. However, LDA demonstrates an ability to differentiate between various country groups when dealing with long text embeddings. This observation suggests that long-text embeddings, formed by concatenating short texts, have the capacity to capture a broader range of features, potentially encompassing factors associated with specific countries. Therefore, this discussion will only rely on long text data when referring to the LDA method.
We compare the distances between countries in BERT embedding space (x-axis) and in LDA space (y-axis) to address this concern. The LDA space retains information from BERT embeddings when a clear and interpretable relationship exists.
Figure 13a demonstrates that LDA results have a relationship with BERT embedding differences. It shows that baseline data, which includes all countries and regions in Twitter in reality, is significantly different from regional data. Figure 13b zooms in on the bottom left part of Fig. 13a. The small p-values of Spearman rho and Kendall tau assure the positive relationship between LDA distance and BERT embedding difference. Two regional data clusters will be further apart in LDA space if their difference in BERT embedding space increases, and vice versa (Figs. 14, 15).
1.3 BERT versus performance results
This subsection explores how the BERT embedding space can influence the ultimate performance metrics of accuracy, AUC, precision, and recall. Section “Sentiment anlayis” discusses the results of the sentiment analysis task on Sentiment140 data, while Section “Multi-label classification” discusses the results of the multi-label classification task on Reuters21578 data.
1.3.1 Sentiment anlayis
Table 4 summarizes the p-values of the significance tests for the detected correlation relationship. Significant p-values are bolded. Variation in short-text BERT embeddings has a relationship with the precision score, while the difference in long-text BERT embeddings, containing more region information, has a relationship with accuracy and AUC results. Figure 16 illustrates the significant relationship between the difference in short text or long text BERT embeddings and variations in performance. The y-axis indicates the proximity of the embedding distribution between two data groups, and the x-axis demonstrates the discrepancy size in evaluation metrics. For instance, in Fig. 16a, groups in the bottom left corner are relatively similar to each other, whereas groups in the top right corner are more distinct. In other words, Fig. 16a suggests that Canadian English and American English, as well as British English and American English, exhibit similarity both in their short text BERT embedding distributions and in precision for predicting negative labels. On the other hand, it also indicates that Canadian English and Kiwi English, along with American English and Kiwi English, demonstrate lower similarity in their BERT distribution and predictive precision for negative labels. Similarly, Fig. 16b depicts the positive relationship between the precision score for positive sentiment prediction and the short text BERT embedding distribution. Long-text data here is believed to illustrate nuanced geographic features. The baseline data is achieved with the train-test split method, implying baseline data comes from the same distribution of training data. Baseline data cluster in the top right corner. This shows that baseline data is significantly different from regional data and has a larger performance gap with them (around \(2\%\) to \(4\%\)) (Fig. 16).
In summary, Fig. 16 demonstrates that the variation in short text embedding distribution is linked to the precision of prediction for both labels The positive correlations in the plot indicate a larger gap in precision (right in a subplot) score when their embedding difference is larger (top in a subplot). Long-text highlights group features. The greater the difference in long text BERT embeddings, the more significant the performance gap of accuracy and AUC tends to be across arbitrary pairs of comparison groups. Group distinct feature distinction will be further described in Section “LDA space findings”.
1.3.2 Multi-label classification
Similar signs can also be glimpsed in the Reuters21578 for the multi-labelling task. In contrast to Sentiment140 data, which involves tweets, the news articles in Reuters21578 do not necessitate additional artificial data construction due to their satisfactory document length. Multi-class labeling discards AUC as an evaluation metric. Figure 17 illustrates the correlation between the disparities in BERT embeddings and the variations in performance. It can be segmented into three distinct sections: the lower-left, upper-right, and upper-left. In the lower-left quadrant, data from Canada, the United States, and multiple areas converge, suggesting that news publications from these regions exhibit close proximity in the BERT embedding space. Furthermore, they demonstrate a smaller performance gap in multi-class labeling tasks. Conversely, the upper-right corner is dominated by data from the United Kingdom, indicating significant dissimilarity from the Canada-US-Multiple group in both BERT embeddings and a notable performance gap.
In the upper-left quadrant, the data from Australia forms clusters. Although Australia exhibits a smaller performance gap, its BERT embeddings significantly differ from those of the Canada-US-Multiple group. This behavior contrasts with the UK data. One plausible explanation for this phenomenon is the relatively small size of the Australian dataset (35 articles), as shown in Fig. 2. The statistical significance of the relationship emerges upon excluding results from Australia. Given the limitations imposed by the data size in the Reuters21578 dataset, one can infer that a positive correlation exists between BERT differences and performance gaps in multi-label classification tasks.
1.3.3 Summary
Overall, there exists a relationship between BERT difference and performance gap (accuracy difference and precision). If the language features in two regions are similar, then the model tends to have similar performance (smaller performance gap) on them. We can infer that the model performance will be overestimated if the language data from one region has distinct features from mainstream benchmark datasets. Thus, an undesirable performance drop might occur on this dataset as demonstrated in Figs. 9 and 10. Specifically, the variance in BERT embeddings, influencing predictions, exhibits a positive association with precision scores. As the disparity in BERT embeddings increases, so does the precision score. Notably, the Reuters21578 dataset comprises news text, with a length of approximately 90–100, closely resembling the long-text data from the artificial sentiment140 dataset, which ranges around 100–140. Both long-text embeddings appear to be linked to accuracy, where a larger difference in long-text embeddings corresponds to a greater accuracy gap. However, no discernible pattern has been identified for recall scores yet.
1.4 BERT versus performance distance correlation
The performance distance correlation gauges the fluctuation in performance among various models. It measures the dependence between region data responding to different models. In other words, it examines whether a model exhibits similar enhancements on one dataset when it demonstrates improvement on another dataset.This subsection discusses whether the bias will affect the change in model performance. When two datasets exhibit similarity, their performances are anticipated to align closely. Consequently, the distance correlation is expected to be higher when they undergo improvements or deterioration to a comparable extent for each fine-tuned model. This subsection excludes the Reuters21578 dataset, as only one model is trained on it.
Figure 18 validates the distance correlation as a measurement for fluctuation in performance. Not surprisingly, they have a significant relationship with each other. Distance correlation gauges the similarity in the way performance changes and how much it increases or decreases when various models are applied to the same test dataset. If a more proficient model is applied to the dataset, distance correlation can reveal how similar the growth in the evaluation metric is.
Figure 19 illustrates the connection between BERT distribution difference and distance correlation. In general, a negative association exists between them. To illustrate, consider Fig. 19a as an example. In the top right corner, Australian English and Kiwi English, as well as Canadian English and Kiwi English, have lower distance correlation while their distance in long-text BERT embedding space. On the other hand, Mixed English and American English, as well as British English and American English, display stronger distance correlation in their performance, increasing together by a similar degree when their proximity in the long-text BERT embedding space is close. When two datasets exhibit greater similarity in the long-text BERT embedding space, their performance distance correlation tends to be higher. This suggests a parallel fluctuation in performance across various fine-tuned models when two datasets are similar and vice versa.
In summary, the performance variance resulting from different models exhibits an inverse connection with the disparity in their BERT distributions of the long-text corpus. The performance variance is different when their BERT distribution is different. This observation may raise concerns about predicting performance across diverse datasets. It underscores that having a better-trained model does not automatically guarantee improved prediction results for all datasets, owing to the nuances of BERT distribution difference. For example, in Fig. 19a, the baseline data clusters in the top-left corner of the plot, indicating a distance correlation in the range of 0.3–0.5. The components of the baseline data closely align with real-life test data, encompassing users from diverse regions and countries. However, the plot reveals a weak to moderate correlation with our regional datasets. This suggests that the performance of regional data across various models, fine-tuned with baseline training data, does not undergo as significant improvement or deterioration as observed in the baseline data itself. And the feature differences presented by long-text embeddings hold a relationship with the variation in distance correlation. Section “LDA space findings” further discusses how regional features expressed by long-text corpus relate to performance.
1.5 LDA space findings
This subsection discusses the findings from the feature subspace found by LDA.
Recall that LDA space represents a specialized feature subspace within artificial long-text sentiment data, emphasizing maximized distances between regional groups. The outcomes derived from the LDA space align closely with results obtained through long-text BERT analysis, as both approaches involve exploring relationships within extensive textual data. The findings in this subsection serve to corroborate the connection between the regional features expressed in long-text data and the ultimate performance outcomes. The analysis of LDA findings is divided in two parts: performance gap (accuracy and AUC) and distance correlation of performance evaluation.
Figure 20 demonstrates the positive relationship between distance in LDA space and the performance. For example, in Fig. 20b, Kiwi English and British English, as well as Australian English and British English, are situated in the lower-left corner of the plot, indicating their proximity in the projected feature space. This suggests a close identification in terms of regional features. And there is a smaller gap in AUC metrics. In contrast, most comparisons with the baseline data, which comprises a mix of all English-speaking Twitter users, are positioned in the upper-right corner. This suggests that regional data diverges significantly from the baseline data in the regional feature space, with larger performance gaps. In summary, the closer two region groups are in the LDA feature space, the smaller their accuracy or AUC score gaps tend to be. This implies that when two sets of regional data exhibit similarity in regional features, their performance gaps are expected to be comparatively small.
Figure 21 illustrates the relationship between the distance in LDA space and distance correlation of performance. Recall that distance correlation distance correlation reflects the degree of synchronization when applied to various fine-tuned models. The negative relationship in the plots implies that the distance correlation is small when the distance in LDA space is large. Region groups with more distinct regional features, positioned farther in the LDA space, tend to exhibit more pronounced differences in performance behavior when subjected to diverse fine-tuned models. As an illustration, the baseline data group clusters in the top left corner, indicating its substantial separation from other region groups in the LDA space. The majority of this cluster displays a distance correlation smaller than 0.5, signifying a weak dependency. This weak dependency implies that the model performance on region group data often diverges from that on the baseline data, which serves as the standard training data and testing data when applied to various fine-tuned models.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lyu, J., Dost, K., Koh, Y.S. et al. Regional bias in monolingual English language models. Mach Learn 113, 6663–6696 (2024). https://doi.org/10.1007/s10994-024-06555-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-024-06555-6