[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3485447.3512188acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Fair k-Center Clustering in MapReduce and Streaming Settings

Published: 25 April 2022 Publication History

Abstract

Center-based clustering techniques are fundamental to many real-world applications such as data summarization and social network analysis. In this work, we study the problem of fairness aware k-center clustering over large datasets. We are given an input dataset comprising a set of n points, where each point belongs to a specific demographic group characterized by a protected attribute, such as race or gender. The goal is to identify k clusters such that all clusters have considerable representation from all groups and the maximum radius of these clusters is minimized.
The majority of the prior techniques do not scale beyond 100K points for k = 50. To address the scalability challenges, we propose an efficient 2-round algorithm for the MapReduce setting that is guaranteed to be a 9-approximation to the optimal solution. Additionally, we develop a 2-pass streaming algorithm that is efficient and has a low memory footprint. These theoretical results are complemented with an empirical evaluation on million-scale datasets, demonstrating that our techniques are effective to identify high-quality fair clusters and efficient as compared to the state-of-the-art.

References

[1]
[n.d.]. Google + dataset, SNAP, http://snap.stanford.edu/data/ego-Gplus.html.
[2]
[n.d.]. Pokec dataset, SNAP, http://snap.stanford.edu/data/soc-Pokec.html.
[3]
Saba Ahmadi, Sainyam Galhotra, Barna Saha, and Roy Schwartz. 2020. Fair Correlation Clustering. CoRR abs/2002.03508(2020). arXiv:2002.03508https://arxiv.org/abs/2002.03508
[4]
Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. 2019. Clustering Without Over-Representation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 267–275.
[5]
Paul Beame, Paraschos Koutris, and Dan Suciu. 2017. Communication Steps for Parallel Query Processing. Journal of the ACM (JACM) 64, 6 (2017), 40.
[6]
Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. 2019. Fair algorithms for clustering. In Conference on Neural Information Processing Systems. 4955–4966.
[7]
Ioana O. Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R. Schmidt, and Melanie Schmidt. 2019. On the cost of essentially fair clusterings. In International Workshop on Approximation Algorithms for Combinatorial Optimization Problems.
[8]
Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. 2019. Solving K-Center Clustering (with Outliers) in MapReduce and Streaming, Almost as Accurately as Sequentially. Proc. VLDB Endow. 12, 7 (2019), 766–778.
[9]
Deeparnab Chakrabarty, Sanjeev Khanna, and Shi Li. 2015. On (1, ε)-restricted assignment makespan minimization. In Annual ACM-SIAM Symposium on Discrete Algorithms.
[10]
Daqing Chen, Sai Laing Sain, and Kun Guo. 2012. Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing & Customer Strategy Management 19, 3(2012), 197–208.
[11]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. 2017. Fair clustering through fairlets. In Proc. 31st Conference on Neural Information Processing Systems. 5029–5037.
[12]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6-8, 2004. 137–150.
[13]
Alina Ene, Sungjin Im, and Benjamin Moseley. 2011. Fast clustering using MapReduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 681–689.
[14]
Sainyam Galhotra. 2021. Robust Algorithms for Clustering with Applications to Data Integration. (2021).
[15]
Teofilo F. Gonzalez. 1985. Clustering to Minimize the Maximum Intercluster Distance. Theor. Comput. Sci. 38(1985), 293 – 306.
[16]
Aniko Hannak, Gary Soeller, David Lazer, Alan Mislove, and Christo Wilson. 2014. Measuring Price Discrimination and Steering on E-Commerce Web Sites. In Proceedings of the 2014 Conference on Internet Measurement Conference(IMC ’14). Association for Computing Machinery, New York, NY, USA, 305–318.
[17]
Anikó Hannák, Claudia Wagner, David Garcia, Alan Mislove, Markus Strohmaier, and Christo Wilson. 2017. Bias in Online Freelance Marketplaces: Evidence from TaskRabbit and Fiverr. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. Association for Computing Machinery, New York, NY, USA, 1914–1933.
[18]
Elfarouk Harb and Ho Shan Lam. 2020. KFC: A Scalable Approximation Algorithm for k-center Fair Clustering. In Advances in Neural Information Processing Systems. 14509–14519.
[19]
Lingxiao Huang, Shaofeng Jiang, and Nisheeth Vishnoi. 2019. Coresets for clustering with fairness constraints. In Proc. 33rd Conference on Neural Information Processing Systems. 7587–7598.
[20]
IBM. 2019. IBM ILOG CPLEX 12.9. (2019).
[21]
Sungjin Im and Benjamin Moseley. 2015. Fast and better distributed mapreduce algorithms for k-center clustering. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures. 65–67.
[22]
Srinivasa KG, K Venugopal, and L Patnaik. 2006. Feature extraction using fuzzy c-means clustering for data mining systems. IJCSNS 6, 3A (2006), 230.
[23]
Bjornar Larsen and Chinatsu Aone. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. 16–22.
[24]
Gustavo Malkomes, Matt J Kusner, Wenlin Chen, Kilian Q Weinberger, and Benjamin Moseley. 2015. Fast distributed k-center clustering with outliers on massive data. Advances in Neural Information Processing Systems 28 (2015), 1063–1071.
[25]
Richard Matthew McCutchen and Samir Khuller. 2008. Streaming algorithms for k-center clustering with outliers and with anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques. Springer, 165–178.
[26]
Clemens Rösner and Melanie Schmidt. 2018. Privacy Preserving Clustering with Constraints. In Proc. 45th International Colloquium on Automata, Languages and Programming. 96:1–96:14.
[27]
Badrul M Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2002. Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering. In Proceedings of the fifth international conference on computer and information technology, Vol. 1. 291–324.
[28]
David B. Shmoys and Éva Tardos. 1993. An approximation algorithm for the generalized assignment problem. Mathematical programming 62, 1-3 (1993), 461–474.
[29]
Till Speicher, Muhammad Ali, Giridhari Venkatadri, Filipe Nunes Ribeiro, George Arvanitakis, Fabrício Benevenuto, Krishna P Gummadi, Patrick Loiseau, and Alan Mislove. 2018. Potential for Discrimination in Online Targeted Advertising. In Conference on Fairness, Accountability and Transparency. 5–19.

Cited By

View all
  • (2024)New Algorithms for Distributed Fair k-Center Clustering: Almost Accurate as Sequential AlgorithmsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663057(1938-1946)Online publication date: 6-May-2024
  • (2024)Fast and Accurate Fair k-Center Clustering in Doubling MetricsProceedings of the ACM Web Conference 202410.1145/3589334.3645568(756-767)Online publication date: 13-May-2024
  • (2023)F3KM: Federated, Fair, and Fast k-meansProceedings of the ACM on Management of Data10.1145/36267281:4(1-25)Online publication date: 12-Dec-2023

Index Terms

  1. Fair k-Center Clustering in MapReduce and Streaming Settings
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '22: Proceedings of the ACM Web Conference 2022
        April 2022
        3764 pages
        ISBN:9781450390965
        DOI:10.1145/3485447
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 25 April 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. disparate impact
        2. fairness
        3. k-center clustering

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        WWW '22
        Sponsor:
        WWW '22: The ACM Web Conference 2022
        April 25 - 29, 2022
        Virtual Event, Lyon, France

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)91
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 03 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)New Algorithms for Distributed Fair k-Center Clustering: Almost Accurate as Sequential AlgorithmsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663057(1938-1946)Online publication date: 6-May-2024
        • (2024)Fast and Accurate Fair k-Center Clustering in Doubling MetricsProceedings of the ACM Web Conference 202410.1145/3589334.3645568(756-767)Online publication date: 13-May-2024
        • (2023)F3KM: Federated, Fair, and Fast k-meansProceedings of the ACM on Management of Data10.1145/36267281:4(1-25)Online publication date: 12-Dec-2023

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media