Fair Max–Min Diversity Maximization in Streaming and Sliding-Window Models †
<p>Comparison of (<b>a</b>) max–sum dispersion (MSD) and (<b>b</b>) max–min dispersion (MMD) for diversity maximization on a dataset of one hundred points. We use circles and crossmarks to denote all points in the dataset and the points selected based on MSD and MMD.</p> "> Figure 2
<p>Comparison of (<b>a</b>) unconstrained max–min diversity maximization and (<b>b</b>) fair max–min diversity maximization. We have a set of individuals, each described by two attributes, partitioned into two disjoint groups of red and blue, respectively. Fair diversity maximization returns a subset of size 10 that maximizes diversity in terms of attributes and contains an equal number (i.e., <math display="inline"><semantics> <mrow> <msub> <mi>k</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>5</mn> </mrow> </semantics></math>) of elements from both groups.</p> "> Figure 3
<p>Illustration of the <tt>SFDM1</tt> algorithm. During stream processing, one group-blind and two group-specific candidates are maintained for each guess <math display="inline"><semantics> <mi>μ</mi> </semantics></math> of <math display="inline"><semantics> <msub> <mi mathvariant="monospace">OPT</mi> <mi>f</mi> </msub> </semantics></math>. Then, a subset of group-blind candidates is selected for post-processing by adding the elements from the under-filled group before deleting the elements from the over-filled one.</p> "> Figure 4
<p>Illustration of post-processing in <tt>SFDM2</tt>. For each <math display="inline"><semantics> <mrow> <mi>μ</mi> <mo>∈</mo> <msup> <mi mathvariant="script">U</mi> <mo>′</mo> </msup> </mrow> </semantics></math>, an initial <math display="inline"><semantics> <msubsup> <mi>S</mi> <mi>μ</mi> <mo>′</mo> </msubsup> </semantics></math> is first extracted from <math display="inline"><semantics> <msub> <mi>S</mi> <mi>μ</mi> </msub> </semantics></math> by removing the elements from over-filled groups. Then, the elements in all candidates are divided into clusters. The final <math display="inline"><semantics> <msubsup> <mi>S</mi> <mi>μ</mi> <mo>′</mo> </msubsup> </semantics></math> is augmented from the initial solution by adding new elements from under-filled groups based on matroid intersection.</p> "> Figure 5
<p>Illustration of the framework of sliding-window algorithms. During stream processing, two candidate solutions <math display="inline"><semantics> <msub> <mi>A</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>B</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> </msub> </semantics></math>, along with their backups <math display="inline"><semantics> <msubsup> <mi>A</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> <mo>′</mo> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi>B</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> <mo>′</mo> </msubsup> </semantics></math>, are maintained for each guess <math display="inline"><semantics> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> </semantics></math> of <math display="inline"><semantics> <mrow> <mi mathvariant="monospace">OPT</mi> <mo>[</mo> <mi>W</mi> <mo>]</mo> </mrow> </semantics></math>. Then, during post-processing, the elements in <math display="inline"><semantics> <msub> <mi>B</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>A</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> </msub> </semantics></math> (or non-expired elements in <math display="inline"><semantics> <msubsup> <mi>A</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> <mo>′</mo> </msubsup> </semantics></math> if <math display="inline"><semantics> <msub> <mi>A</mi> <mrow> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> </mrow> </msub> </semantics></math> has expired) are passed to an existing algorithm for solution computation.</p> "> Figure 6
<p>Performance of <tt>SFDM1</tt> and <tt>SFDM2</tt> with varying parameter <math display="inline"><semantics> <mi>ε</mi> </semantics></math> on (<b>a</b>) Adult (Sex, <math display="inline"><semantics> <mrow> <mi>m</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>), (<b>b</b>) CelebA (Sex, <math display="inline"><semantics> <mrow> <mi>m</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>), (<b>c</b>) Census (Sex, <math display="inline"><semantics> <mrow> <mi>m</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>), and (<b>d</b>) Lyrics (Genre, <math display="inline"><semantics> <mrow> <mi>m</mi> <mo>=</mo> <mn>15</mn> </mrow> </semantics></math>) when <math display="inline"><semantics> <mrow> <mi>k</mi> <mo>=</mo> <mn>20</mn> </mrow> </semantics></math>.</p> "> Figure 7
<p>Solution quality of different algorithms in the streaming setting with varying solution sizes, <span class="html-italic">k</span>. The diversity values of <tt>GMM</tt> are plotted as gray lines to illustrate the “price of fairness”, i.e., the losses in diversity caused by incorporating the fairness constraints.</p> "> Figure 8
<p>Update time of different algorithms in the streaming setting with varying solution sizes, <span class="html-italic">k</span>.</p> "> Figure 9
<p>Solution quality and update time on synthetic datasets in the streaming setting with varying dataset sizes, <span class="html-italic">n</span>, and numbers of groups, <span class="html-italic">m</span> (<math display="inline"><semantics> <mrow> <mi>k</mi> <mo>=</mo> <mn>20</mn> </mrow> </semantics></math>).</p> "> Figure 10
<p>Comparison of different algorithms on <span class="html-italic">Adult</span> for equal representation (ER) and proportional representation (PR) when <math display="inline"><semantics> <mrow> <mi>k</mi> <mo>=</mo> <mn>20</mn> </mrow> </semantics></math>.</p> "> Figure 11
<p>Solution quality of different algorithms in the sliding-window setting with varying solution size <span class="html-italic">k</span> (<math display="inline"><semantics> <mrow> <mi>w</mi> <mo>=</mo> <mn>25</mn> </mrow> </semantics></math> k for adult and 100 k for others). The diversity values of <tt>GMM</tt> are also plotted as gray lines to illustrate the “price of fairness”.</p> "> Figure 12
<p>Update time of different algorithms in the sliding-window setting with varying solution size <span class="html-italic">k</span> (<math display="inline"><semantics> <mrow> <mi>w</mi> <mo>=</mo> <mn>25</mn> </mrow> </semantics></math> k for Adult and 100 k for others).</p> "> Figure 13
<p>Solution quality and update time on synthetic datasets in the sliding-window setting with varying window size <span class="html-italic">w</span> and number of groups <span class="html-italic">m</span> (<math display="inline"><semantics> <mrow> <mi>k</mi> <mo>=</mo> <mn>20</mn> </mrow> </semantics></math>).</p> ">
Abstract
:1. Introduction
- Our Contributions: In this paper, we propose novel streaming and sliding-window algorithms for the max–min diversity maximization problem with fairness constraints. Our main contributions are summarized as follows:
- We formally define the problem of fair max–min diversity maximization (FDM) in metric spaces. Then, we describe the existing streaming and sliding-window algorithms for (unconstrained) max–min diversity maximization [14]. In particular, we improve the approximation ratio of the existing streaming algorithm from to for any parameter by refining the analysis of [14].
- We propose two novel streaming algorithms for FDM. Our first algorithm, called SFDM1, is -approximate for FDM when there are two groups in the dataset. It takes time per element in the streaming processing, where is the ratio of the maximum and minimum distances between any pair of elements, spends time for post-processing, and stores elements in memory. Our second algorithm, called SFDM2, is -approximate for FDM with an arbitrary number m of groups. SFDM2 also takes time per element in the streaming processing but requires a longer time for post-processing and stores elements in memory.
- We further extend our two streaming algorithms to the sliding-window model. The extended SWFDM1 and SWFDM2 algorithms achieve approximation factors of and for FDM with and an arbitrary m when any -approximation algorithm for unconstrained max–min diversity maximization is used for post-processing. Additionally, their time and space complexities increase by a factor of compared with SFDM1 and SFDM2, respectively.
- Finally, we evaluate the performance of our proposed algorithms against the state-of-the-art algorithms on several real-world and synthetic datasets. The results demonstrate that our algorithms provide solutions of comparable quality for FDM to those returned by the state-of-the-art algorithms while running several orders of magnitude faster in the streaming and sliding-window settings.
- Paper Organization: The rest of this paper is organized as follows. The related work is reviewed in Section 2. In Section 3, we introduce the basic concepts and formally define the FDM problem. In Section 4, we first propose our streaming algorithms for FDM. In Section 5, we further design our sliding-window algorithms for FDM. Our experimental setup and results are described in Section 6. Finally, we conclude the paper in Section 7.
2. Related Work
3. Preliminaries
4. Streaming Algorithms
4.1. (Unconstrained) Streaming Algorithm
Algorithm 1 SDM |
|
4.2. Fair Streaming Algorithm for
Algorithm 2 SFDM1 |
|
- Theoretical Analysis: We prove that SFDM1 achieves an approximation ratio of for FDM, where , in Theorem 2. The proof is based on (i) the existence of such that (Lemma 1) and (ii) for each after post-processing (Lemma 2). Then, we analyze the complexity of SFDM1 in Theorem 3.
- Comparison with Prior Art: The idea of finding a solution and balancing it for fairness in SFDM1 has also been used for FairSwap [17]. However, FairSwap only works in the offline setting, which keeps the dataset in memory and requires random accesses for computation, whereas SFDM1 works in the streaming setting, which scans the dataset in one pass and uses only the elements in the candidates for post-processing. Compared with FairSwap, SFDM1 reduces the space complexity from to and the time complexity from to at the expense of lowering the approximation ratio by a factor of .
4.3. Fair Streaming Algorithm for General m
Algorithm 3 SFDM2 |
|
- Matroid Intersection: Next, we describe how to use matroid intersection for solution augmentation in SFDM2. We define the first rank-k matroid based on the fairness constraint, where the ground set V is and iff , . Intuitively, a set S is fair if it is a maximal independent set in . Moreover, we define the second rank-l () matroid on the set of clusters, where the ground set V is also and if , . Accordingly, the problem of adding new elements to to ensure fairness is an instance of the matroid intersection problem, which aims to find a maximum cardinality set for and . Here, we adopt Cunningham’s algorithm [41], a well-known solution for the matroid intersection problem based on the augmentation graph in Definition 2.
Algorithm 4 Matroid Intersection |
|
- Theoretical Analysis: We prove that SFDM2 achieves an approximation ratio of for FDM. The high-level idea of the proof is to connect the clustering procedure in post-processing with the notion of matroid and then to utilize the geometric properties of the clusters and the theoretical results of matroid intersection for approximation. Next, we first show that the set of clusters has several important properties (Lemma 3). Then, we prove that Algorithm 4 can return a fair solution for a specific based on the properties of (Lemma 4). Finally, we analyze the time and space complexities of SFDM2 in Theorem 5.
- Case 1: For each , such that , we have for each . Given the optimal solution , we define a function f that maps each to its nearest neighbor in . For two elements in these groups, we have , , and . Therefore, . Since , . According to Property (iii) of Lemma 3, it is guaranteed that and are in different clusters. By identifying all the clusters that contain for all , we found clusters for each group such that . All the clusters that were found are guaranteed to be distinct.
- Case 2: For all such that , we are able to find k clusters that contain one element from based on Property (ii) of Lemma 3. For such a group i, even though clusters have been identified for all other groups, there are still at least clusters available for selection. Therefore, we can always find clusters that are distinct from all the clusters identified by any other group for such a group .
- Comparison with Prior Art: Existing methods have aimed to find a fair solution based on matroid intersection for fair k-center [21,22,44] and fair max–min diversity maximization [17]. SFDM2 adopts a similar method to FairFlow [17] to construct the clusters and matroids. However, FairFlow solves matroid intersection as a max-flow problem on a directed graph. Its solution is of poor quality in practice, particularly when m is large. Therefore, SFDM2 uses a different method for matroid intersection based on Cunningham’s algorithm, which initializes with a partial solution instead of an empty set for higher efficiency and adds elements greedily like GMM [39] for higher diversity. Hence, SFDM2 has a significantly higher solution quality than FairFlow in practice, though it has a slightly lower approximation ratio.
5. Sliding-Window Algorithms
5.1. (Unconstrained) Sliding-Window Algorithm
Algorithm 5 SWDM |
|
5.2. Fair Sliding-Window Algorithms
Algorithm 6 SWFDM |
|
- Theoretical Analysis: Subsequently, we will analyze the theoretical soundness and complexities of the extended SWFDM1 and SWFDM2 algorithms for FDM in the sliding-window model by generalizing the analyses for SFDM1 and SFDM2 in Section 4.
6. Experiments
6.1. Experimental Setup
- Datasets: Our experiments are conducted on four publicly available real-world datasets, as follows:
- Adult (https://archive.ics.uci.edu/dataset/2/adult, accessed on 12 July 2023) is a collection of records from the 1994 US Census database. We select six numeric attributes as features and normalize each of them to have zero mean and unit standard deviation. The Euclidean distance is used as the distance metric. The groups are generated from two demographic attributes: sex and race. By using them individually and in combination, there are two (sex), five (race), and ten (sex + race) groups, respectively.
- CelebA (https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, accessed on 12 July 2023) is a set of images of human faces. We use 41 pre-trained class labels as features and the Manhattan distance as the distance metric. We generate two groups from sex {‘female’, ‘male’}, two groups from age {‘young’, ‘not young’}, and four groups from their combination, respectively.
- Census (https://archive.ics.uci.edu/dataset/116/us+census+data+1990, accessed on 12 July 2023) is a set of records from the 1990 US Census data. We take 25 (normalized) numeric attributes as features and use the Manhattan distance as the distance metric. We generate 2, 7, and 14 groups from sex, age, and both of them, respectively.
- Lyrics (http://millionsongdataset.com/musixmatch, accessed on 12 July 2023) is a set of documents, each of which is the lyrics of a song. We train a topic model with 50 topics using LDA [45] implemented in Gensim (https://radimrehurek.com/gensim, accessed on 12 July 2023). Each document is represented as a 50-dimensional vector and the angular distance is used as the distance metric. We generate 15 groups based on the primary genres of songs.
- Algorithms: We compare our streaming algorithms, i.e., SFDM1 and SFDM2, and sliding-window algorithms, i.e., SWFDM1 and SWFDM2, with four existing offline FDM algorithms: the -approximation FairFlow algorithm for an arbitrary m, the -approximation FairGMM algorithm for small k and m, and the -approximation FairSwap algorithm specific for in [17], and the -approximation FairGreedyFlow algorithm for an arbitrary m in [20]. Since no implementation for the algorithms in [17,20] is available, we implement them by ourselves, following the description in the original paper. All the algorithms are implemented in Python 3. All experiments were run on a desktop with an Intel® Core™ i5-9500 3.0GHz processor and 32GB RAM running Ubuntu 20.04.3 LTS. Each algorithm was run on a single thread.
- Performance Metrics: The performance of each algorithm is evaluated in terms of efficiency, quality, and space usage. The efficiency is measured as average update time, i.e., the average wall-clock time used to compute a solution for each arrival element in the stream. The quality is measured by the value of the diversity function of the solution returned by an algorithm. Since computing the optimal diversity of FDM is infeasible, we run GMM [39] for unconstrained diversity maximization to estimate an upper bound of for comparison. Space usage is measured by the number of distinct elements stored by each algorithm. However, only the numbers of elements stored by our proposed algorithms are presented because offline algorithms should keep all elements in memory for random access, and thus their space usage is always equal to the dataset (or window) size. We run each experiment 10 times with different permutations of the same dataset and report the average of each measure over 10 runs.
6.2. Results in Streaming Setting
- Effect of Parameter :Figure 6 illustrates the performance of SFDM1 and SFDM2 with different values of when . We range the value of from to on Adult, CelebA, and Census and from to on Lyrics. Since the angular distances between any two vectors are at most , larger values of (e.g., >0.1) leads to greater estimation errors for and thus significantly lower solution quality in Lyrics. Generally, SFDM1 has higher efficiency and smaller space usage than SFDM2 for different values of , but SFDM2 exhibits a better solution quality. Furthermore, the running time and numbers of stored elements of both algorithms significantly decrease when the value of increases. This is consistent with our analyses in Section 4 because the number of guesses for , and thus the number of candidates maintained by both algorithms is . A slightly surprising result is that the diversity values of the solutions do not obviously degrade, even when . This can be explained by the fact that both algorithms return the best solutions after post-processing among all candidates, which means that they provide good solutions as long as there is some close to . We infer that still exists when . Nevertheless, we note that the chance of finding an appropriate value of will be smaller when the value of is larger, which will result in a less stable solution quality. Therefore, in the experiments for streaming setting, we always use for both algorithms on all datasets, except Lyrics, where the value of is set to . The impact of on the performance of SWFDM1 and SWFDM2 is generally similar to that of SFDM1 and SFDM2. However, since the number of candidate solutions is quadratic with respect to , we use a larger for SWFDM1 and SWFDM2 on all datasets except Lyrics, where the value of is set to .
- Overview:Table 2 presents the performance of different algorithms for FDM in the streaming setting on four real-world datasets with different group partitions when the solution size k is fixed to 20. FairGMM is not included because it needs to enumerate, at most, candidates for solution computation and cannot scale to and . First, compared with the unconstrained solution returned by GMM, all the fair solutions are less diverse because of the additional fairness constraints. Since GMM is a -approximation algorithm and , is the upper bound of , from which we observe that all five fair algorithms return solutions with much better approximation ratios than their lower bounds. In case of , SFDM1 runs the fastest among all five algorithms, which achieves a speed-up of from two to four orders of magnitude over FairSwap, FairFlow and FairGreedyFlow. At the same time, its solution quality is close to or equal to that of FairSwap in most cases. SFDM2 shows lower efficiency than SFDM1 due to the higher cost of post-processing. However, it is still much more efficient than offline algorithms by taking advantage of stream processing. In addition, the solution quality of SFDM2 benefits from the greedy selection procedure in Algorithm 4, which is not only consistently better than that of SFDM1 but also better than that of FairSwap on the Adult and Census datasets. In the case of , SFDM1 and FairSwap are not applicable anymore. In addition, FairGreedyFlow cannot finish within one day on the Census dataset, and the corresponding results are also ignored. SFDM2 shows significant advantages compared to FairFlow and FairGreedyFlow in terms of both solution quality and efficiency. It provides up to times more diverse solutions than FairFlow and FairGreedyFlow while running several orders of magnitude faster. In terms of space usage, both SFDM1 and SFDM2 store very small portions of elements (<0.1% on Census) on all datasets. SFDM2 contains slightly more elements than SFDM1 because the capacity of each group-specific candidate for group i is set to k instead of . For SFDM2, the number of stored elements increases nearly linearly with m since the number of candidates is linear to m.
- Effect of Solution Size k: The impact of solution size k on the performance of different algorithms in the streaming setting is shown in Figure 7 and Figure 8. Here, we vary k in when , or when , or when , since we restrict that an algorithm must pick at least one element from each group. For each algorithm, the diversity value drops with k as the diversity function is monotonically non-increasing. At the same time, the update time grows with k, as their time complexities are linear or quadratic w.r.t. k. Compared with the solutions of GMM, all fair solutions are slightly less diverse when . The gaps in diversity values become more apparent when m is larger. Although FairGMM achieves a slightly higher solution quality than any other algorithm when and , it is not scalable to a larger k and m due to the enormous cost of enumeration. The solution quality of FairSwap, SFDM1, and SFDM2 is close to each other when , which is better than that of FairFlow and FairGreedyFlow. However, the efficiencies of SFDM1 and SFDM2 are orders of magnitude higher than those of offline algorithms. Furthermore, when , SFDM2 outperforms FairFlow and FairGreedyFlow in terms of both efficiency and effectiveness across all k values. However, since the time complexity of SFDM2 is quadratic w.r.t. both k and m, its update time increases drastically with k and might be close to that of FairFlow when k and m are large.
- Scalability: We evaluate the scalability of each algorithm in the streaming setting on synthetic datasets by varying the dataset size n from to and the number of groups m from 2 to 20. The results regarding solution quality and update time for different values of n and m when are presented in Figure 9. First of all, SFDM2 shows a much better scalability than FairFlow and FairGreedyFlow w.r.t. m in terms of solution quality. The diversity value of the solution of SFDM2 only slightly decreases with m and is up to 3 times higher than that of FairFlow and FairGreedyFlow when . However, its update time increases more rapidly with m due to the quadratic dependence on m. Moreover, the diversity values of different algorithms slightly grow with n but are always close to each other for different values of n when . Finally, the running time of offline algorithms is linear to n. However, the update time of SFDM1 and SFDM2 are almost independent of n, as analyzed in Section 4.
- Equal vs. Proportional Representation:Figure 10 compares the solution quality and running time of different algorithms for two popular notions of fairness, i.e., equal representation (ER) and proportional representation (PR), when on Adult with highly skewed groups, where of the records are for males and of the records are for Whites. The diversity value of the solution of each algorithm is slightly higher for PR than ER, as the solution for PR is closer to the unconstrained one. The running time of SFDM1 and SFDM2 is slightly shorter for PR than ER, since fewer swapping and augmentation steps are performed on each candidate during post-processing. The results for SWFDM1 and SWFDM2 are similar and will be omitted.
6.3. Results in Sliding-Window Setting
- Overview:Table 3 shows the performance of different algorithms for sliding-window FDM on four real-world datasets with different group settings when the solution size k is fixed to 20 and the window size w is set to 25 k on Adult (as its size is smaller than 100 k) or 100 k on other datasets. FairGMM is also omitted in Table 3 due to its high complexity. Compared with the streaming setting, the “price of fairness” becomes higher in the sliding-window setting, for two possible reasons. First, the approximation factors of our proposed algorithms are lower. Second, some minor groups contain too few elements in the window when the value of m is large (marginally larger than ). Thus, the selection of elements from such groups is very restricted to ensure fairness. Nevertheless, we still find that all fair algorithms provide solutions with much better approximations than their lower bounds.
- Effect of Solution Size k: The impact of solution size k on the performance of different algorithms in the sliding-window setting is illustrated in Figure 11 and Figure 12. We use the same values of k as in the streaming setting. The window size w is set to k for Adult and 100 k for others. For each algorithm, the diversity value drops with k as the diversity function is monotonically non-increasing. At the same time, the update time grows with k as their time complexities are linear or quadratic w.r.t. k. The gaps in diversity values between unconstrained and fair solutions are much larger than those in the streaming setting. The reasons for this were explained in the previous paragraph. The solution quality of SWFDM1 and SWFDM2 is slightly lower than FairSwap when , but is still better than that of FairFlow and FairGreedyFlow. However, their efficiencies are always much higher than those of offline algorithms. Finally, when , SWFDM2 outperforms FairFlow and FairGreedyFlow in terms of efficiency and effectiveness across all k values.
- Scalability: We evaluate the scalability of each algorithm in the sliding-window setting on synthetic datasets by varying the number of groups m from 2 to 20 and the window size w from to . The results regarding solution quality and update time for different values of w and m when are presented in Figure 13. First of all, SWFDM2 shows a much better scalability than FairFlow and FairGreedyFlow w.r.t. m in terms of solution quality. The diversity value of the solution of SWFDM2 only slightly decreases with m. However, for FairFlow and FairGreedyFlow, the diversity values drop drastically with m. Nevertheless, its update time increases more rapidly with m since its time complexity is quadratic w.r.t. m. Furthermore, the results of the diversity values of different algorithms with varying w are similar to those for varying k. As expected, the running time of offline algorithms is nearly linear to w. However, unlike the streaming setting, the update time of SWFDM1 and SWFDM2 increases with w because more candidates are non-expired and thus considered in post-processing for a larger value of w.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Ahmed, M. Data summarization: A survey. Knowl. Inf. Syst. 2019, 58, 249–273. [Google Scholar] [CrossRef]
- Qin, L.; Yu, J.X.; Chang, L. Diversifying Top-K Results. Proc. VLDB Endow. 2012, 5, 1124–1135. [Google Scholar] [CrossRef] [Green Version]
- Zheng, K.; Wang, H.; Qi, Z.; Li, J.; Gao, H. A survey of query result diversification. Knowl. Inf. Syst. 2017, 51, 1–36. [Google Scholar] [CrossRef]
- Gollapudi, S.; Sharma, A. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, 20–24 April 2009; pp. 381–390. [Google Scholar]
- Rafiei, D.; Bharat, K.; Shukla, A. Diversifying web search results. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 781–790. [Google Scholar]
- Kunaver, M.; Pozrl, T. Diversity in recommender systems—A survey. Knowl. Based Syst. 2017, 123, 154–162. [Google Scholar] [CrossRef]
- Zadeh, S.A.; Ghadiri, M.; Mirrokni, V.S.; Zadimoghaddam, M. Scalable Feature Selection via Distributed Diversity Maximization. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 2876–2883. [Google Scholar]
- Celis, L.E.; Keswani, V.; Straszak, D.; Deshpande, A.; Kathuria, T.; Vishnoi, N.K. Fair and Diverse DPP-Based Data Summarization. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 715–724. [Google Scholar]
- Borodin, A.; Lee, H.C.; Ye, Y. Max-Sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Scottsdale, AZ, USA, 21–23 May 2012; pp. 155–166. [Google Scholar]
- Drosou, M.; Pitoura, E. DisC diversity: Result diversification based on dissimilarity and coverage. Proc. VLDB Endow. 2012, 6, 13–24. [Google Scholar] [CrossRef]
- Abbassi, Z.; Mirrokni, V.S.; Thakur, M. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 32–40. [Google Scholar]
- Indyk, P.; Mahabadi, S.; Mahdian, M.; Mirrokni, V.S. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, 15–17 May 2014; pp. 100–108. [Google Scholar]
- Ceccarello, M.; Pietracaprina, A.; Pucci, G. Fast Coreset-based Diversity Maximization under Matroid Constraints. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, 5–9 February 2018; pp. 81–89. [Google Scholar]
- Borassi, M.; Epasto, A.; Lattanzi, S.; Vassilvitskii, S.; Zadimoghaddam, M. Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Amsterdam, The Netherlands, 1–3 July 2019; pp. 254–268. [Google Scholar]
- Bauckhage, C.; Sifa, R.; Wrobel, S. Adiabatic Quantum Computing for Max-Sum Diversification. In Proceedings of the 2020 SIAM International Conference on Data Mining, Cincinnati, OH, USA, 7–9 May 2020; pp. 343–351. [Google Scholar]
- Ceccarello, M.; Pietracaprina, A.; Pucci, G.; Upfal, E. MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension. Proc. VLDB Endow. 2017, 10, 469–480. [Google Scholar] [CrossRef]
- Moumoulidou, Z.; McGregor, A.; Meliou, A. Diverse Data Selection under Fairness Constraints. In Proceedings of the 24th International Conference on Database Theory, Nicosia, Cyprus, 23–26 March 2021; pp. 13:1–13:25. [Google Scholar]
- Drosou, M.; Pitoura, E. Diverse Set Selection Over Dynamic Data. IEEE Trans. Knowl. Data Eng. 2014, 26, 1102–1116. [Google Scholar] [CrossRef]
- Zhang, G.; Gionis, A. Maximizing diversity over clustered data. In Proceedings of the 2020 SIAM International Conference on Data Mining, Cincinnati, OH, USA, 7–9 May 2020; pp. 649–657. [Google Scholar]
- Addanki, R.; McGregor, A.; Meliou, A.; Moumoulidou, Z. Improved Approximation and Scalability for Fair Max-Min Diversification. In Proceedings of the 25th International Conference on Database Theory, Online, 29 March–1 April 2022; pp. 7:1–7:21. [Google Scholar]
- Chiplunkar, A.; Kale, S.; Ramamoorthy, S.N. How to Solve Fair k-Center in Massive Data Models. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1877–1886. [Google Scholar]
- Jones, M.; Nguyen, H.; Nguyen, T. Fair k-Centers via Maximum Matching. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 4940–4949. [Google Scholar]
- Kleindessner, M.; Awasthi, P.; Morgenstern, J. Fair k-Center Clustering for Data Summarization. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3448–3457. [Google Scholar]
- Schmidt, M.; Schwiegelshohn, C.; Sohler, C. Fair Coresets and Streaming Algorithms for Fair k-means. In Proceedings of the 17th International Workshop on Approximation and Online Algorithms, Munich, Germany, 12–13 September 2019; pp. 232–251. [Google Scholar]
- Huang, L.; Jiang, S.H.; Vishnoi, N.K. Coresets for Clustering with Fairness Constraints. Adv. Neural Inf. Process. Syst. 2019, 32, 7587–7598. [Google Scholar]
- El Halabi, M.; Mitrović, S.; Norouzi-Fard, A.; Tardos, J.; Tarnawski, J.M. Fairness in Streaming Submodular Maximization: Algorithms and Hardness. Adv. Neural Inf. Process. Syst. 2020, 33, 13609–13622. [Google Scholar]
- Wang, Y.; Fabbri, F.; Mathioudakis, M. Fair and Representative Subset Selection from Data Streams. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1340–1350. [Google Scholar]
- Wang, Y.; Fabbri, F.; Mathioudakis, M. Streaming Algorithms for Diversity Maximization with Fairness Constraints. In Proceedings of the 38th IEEE International Conference on Data Engineering, Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 41–53. [Google Scholar]
- Cevallos, A.; Eisenbrand, F.; Zenklusen, R. Max-Sum Diversity via Convex Programming. In Proceedings of the 32nd International Symposium on Computational Geometry, Boston, MA, USA, 14–18 June 2016; pp. 26:1–26:14. [Google Scholar]
- Cevallos, A.; Eisenbrand, F.; Zenklusen, R. Local Search for Max-Sum Diversification. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, Barcelona, Spain, 16–19 January 2017; pp. 130–142. [Google Scholar]
- Aghamolaei, S.; Farhadi, M.; Zarrabi-Zadeh, H. Diversity Maximization via Composable Coresets. In Proceedings of the 27th Canadian Conference on Computational Geometry, Kingston, ON, Canada, 10–12 August 2015; pp. 38–48. [Google Scholar]
- Chandra, B.; Halldórsson, M.M. Approximation Algorithms for Dispersion Problems. J. Algorithms 2001, 38, 438–465. [Google Scholar] [CrossRef]
- Erkut, E. The discrete p-dispersion problem. Eur. J. Oper. Res. 1990, 46, 48–60. [Google Scholar] [CrossRef]
- Ravi, S.S.; Rosenkrantz, D.J.; Tayi, G.K. Heuristic and Special Case Algorithms for Dispersion Problems. Oper. Res. 1994, 42, 299–310. [Google Scholar] [CrossRef] [Green Version]
- Hassin, R.; Rubinstein, S.; Tamir, A. Approximation algorithms for maximum dispersion. Oper. Res. Lett. 1997, 21, 133–137. [Google Scholar] [CrossRef]
- Epasto, A.; Mahdian, M.; Mirrokni, V.; Zhong, P. Improved Sliding Window Algorithms for Clustering and Coverage via Bucketing-Based Sketches. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, Virtual, 9–12 January 2022; pp. 3005–3042. [Google Scholar]
- Bhaskara, A.; Ghadiri, M.; Mirrokni, V.S.; Svensson, O. Linear Relaxations for Finding Diverse Elements in Metric Spaces. Adv. Neural Inf. Process. Syst. 2016, 29, 4098–4106. [Google Scholar]
- Wang, Y.; Mathioudakis, M.; Li, J.; Fabbri, F. Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), Minneapolis, MI, USA, 27–29 April 2023; pp. 91–99. [Google Scholar]
- Gonzalez, T.F. Clustering to Minimize the Maximum Intercluster Distance. Theor. Comput. Sci. 1985, 38, 293–306. [Google Scholar] [CrossRef] [Green Version]
- Korte, B.; Vygen, J. Combinatorial Optimization: Theory and Algorithms; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Cunningham, W.H. Improved Bounds for Matroid Partition and Intersection Algorithms. SIAM J. Comput. 1986, 15, 948–957. [Google Scholar] [CrossRef]
- Chakrabarty, D.; Lee, Y.T.; Sidford, A.; Singla, S.; Wong, S.C. Faster Matroid Intersection. In Proceedings of the 60th IEEE Annual Symposium on Foundations of Computer Science, Baltimore, MD, USA, 9–12 November 2019; pp. 1146–1168. [Google Scholar]
- Nguyen, H.L. A note on Cunningham’s algorithm for matroid intersection. arXiv 2019, arXiv:1904.04129. [Google Scholar]
- Chen, D.Z.; Li, J.; Liang, H.; Wang, H. Matroid and Knapsack Center Problems. Algorithmica 2016, 75, 27–52. [Google Scholar] [CrossRef] [Green Version]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Dataset | Group | n | m | # Features | Distance Function |
---|---|---|---|---|---|
Adult | Sex | 48,842 | 2 | 6 | Euclidean |
Race | 5 | ||||
S + R | 10 | ||||
CelebA | Sex | 202,599 | 2 | 41 | Manhattan |
Age | 2 | ||||
S + A | 4 | ||||
Census | Sex | 2,426,116 | 2 | 25 | Manhattan |
Age | 7 | ||||
S + A | 14 | ||||
Lyrics | Genre | 122,448 | 15 | 50 | Angular |
Synthetic | - | – | 2–20 | 2 | Euclidean |
Dataset | Group | GMM | FairSwap | FairFlow | FairGreedyFlow | SFDM1 | SFDM2 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Diversity | Diversity | Time (s) | Diversity | Time (s) | Diversity | Time (s) | Diversity | Time (s) | #Elem | Diversity | Time (s) | #Elem | ||
Adult | Sex | 5.0226 | 4.1485 | 7.06 | 3.1190 | 5.45 | 2.1315 | 59.99 | 3.9427 | 0.0256 | 90.2 | 4.1710 | 0.0965 | 120.4 |
Race | - | - | 1.3702 | 5.82 | 1.1681 | 58.59 | - | - | - | 3.1373 | 1.0175 | 312.3 | ||
S + R | - | - | 1.0049 | 6.55 | 0.8490 | 60.92 | - | - | - | 2.9182 | 3.0914 | 620.6 | ||
CelebA | Sex | 13.0 | 11.4 | 35.13 | 8.4 | 22.97 | 5.0 | 705.4 | 9.8 | 0.0188 | 87.2 | 10.9 | 0.0410 | 122.3 |
Age | 11.4 | 31.69 | 7.2 | 23.06 | 5.0 | 657.4 | 10.4 | 0.0225 | 94.6 | 10.8 | 0.0591 | 128.0 | ||
S + A | - | - | 6.3 | 24.17 | 3.5 | 312.8 | - | - | - | 10.4 | 0.1124 | 193.1 | ||
Census | Sex | 35.0 | 27.0 | 372.3 | 17.5 | 254.6 | - | - | 27.0 | 0.0321 | 121.5 | 31.0 | 0.0931 | 163.0 |
Age | - | - | 8.5 | 294.0 | - | - | - | - | - | 21.0 | 0.8671 | 676.0 | ||
S + A | - | - | 5.0 | 347.5 | - | - | - | - | - | 19.0 | 3.7539 | 1276.0 | ||
Lyrics | Genre | 1.5476 | - | - | 0.4244 | 14.84 | 0.6732 | 302.6 | - | - | - | 1.4463 | 2.6785 | 677.2 |
Dataset | Group | GMM | FairSwap | FairFlow | FairGreedyFlow | SWFDM1 | SWFDM2 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Diversity | Diversity | Time (s) | Diversity | Time (s) | Diversity | Time (s) | Diversity | Time (s) | #Elem | Diversity | Time (s) | #Elem | ||
Adult | Sex | 4.9598 | 4.0568 | 6.11 | 3.0660 | 5.47 | 2.0501 | 13.95 | 3.5445 | 0.870 | 431.5 | 3.4052 | 0.489 | 551.8 |
Race | - | - | 1.2364 | 5.87 | 0.4162 | 7.71 | - | - | - | 2.5212 | 1.056 | 616.1 | ||
S + R | - | - | 0.9105 | 6.43 | 0.3276 | 4.15 | - | - | - | 1.7843 | 1.027 | 799.6 | ||
CelebA | Sex | 12.0 | 11.3 | 28.12 | 7.7 | 23.22 | 6.0 | 138.4 | 9.7 | 2.526 | 427.7 | 8.7 | 0.875 | 560.6 |
Age | 10.6 | 26.99 | 7.3 | 23.11 | 6.0 | 121.5 | 10.5 | 2.966 | 418.5 | 8.7 | 0.976 | 537.2 | ||
S + A | - | - | 6.2 | 24.10 | 4.0 | 112.1 | - | - | - | 8.8 | 2.119 | 864.1 | ||
Census | Sex | 32.0 | 30.0 | 26.33 | 18.0 | 21.49 | 11.0 | 109.9 | 29.0 | 2.593 | 377.0 | 28.0 | 1.614 | 397.0 |
Age | - | - | 5.0 | 22.49 | 2.0 | 76.15 | - | - | - | 13.0 | 2.568 | 646.0 | ||
S + A | - | - | 2.0 | 24.11 | 5.0 | 128.4 | - | - | - | 13.0 | 3.077 | 796.0 | ||
Lyrics | Genre | 1.5586 | - | - | 0.2522 | 20.12 | 0.6432 | 133.4 | - | - | - | 1.2166 | 4.694 | 1132.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Fabbri, F.; Mathioudakis, M.; Li, J. Fair Max–Min Diversity Maximization in Streaming and Sliding-Window Models. Entropy 2023, 25, 1066. https://doi.org/10.3390/e25071066
Wang Y, Fabbri F, Mathioudakis M, Li J. Fair Max–Min Diversity Maximization in Streaming and Sliding-Window Models. Entropy. 2023; 25(7):1066. https://doi.org/10.3390/e25071066
Chicago/Turabian StyleWang, Yanhao, Francesco Fabbri, Michael Mathioudakis, and Jia Li. 2023. "Fair Max–Min Diversity Maximization in Streaming and Sliding-Window Models" Entropy 25, no. 7: 1066. https://doi.org/10.3390/e25071066
APA StyleWang, Y., Fabbri, F., Mathioudakis, M., & Li, J. (2023). Fair Max–Min Diversity Maximization in Streaming and Sliding-Window Models. Entropy, 25(7), 1066. https://doi.org/10.3390/e25071066