[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3488560.3498485acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Open access

k-Clustering with Fair Outliers

Published: 15 February 2022 Publication History

Abstract

Clustering problems and clustering algorithms are often overly sensitive to the presence of outliers: even a handful of points can greatly affect the structure of the optimal solution and its cost. This is why many algorithms for robust clustering problems have been formulated in recent years. These algorithms discard some points as outliers, excluding them from the clustering. However, outlier selection can be unfair: some categories of input points may be disproportionately affected by the outlier removal algorithm.
We study the problem of k-clustering with fair outlier removal and provide the first approximation algorithm for well-known clustering formulations, such as k-means and k-median. We analyze this algorithm and prove that it has strong theoretical guarantees. We complement this result with an empirical evaluation showing that, while standard methods for outlier removal have a disproportionate impact across categories of input points, our algorithm equalizes the impact while retaining strong experimental performances on multiple real--world datasets. We also show how the fairness of outlier removal can influence the performance of a downstream learning task. Finally, we provide a coreset construction, which makes our algorithm scalable to very large datasets.

Supplementary Material

MP4 File (WSDM22-fp603.mp4)
Clustering algorithms are often overly sensitive to the presence of outliers: even a handful of points can greatly affect the structure of the optimal solution and its cost. This is why many algorithms for clustering with outliers have been formulated in recent years. These algorithms discard some points as outliers, excluding them from the clustering. However, outlier selection can be unfair: some categories of input points may be disproportionately affected by the outlier removal algorithm. We study the problem of k-clustering with fair outlier removal and provide the first polynomial-time algorithm for it, showing it has strong theoretical guarantees. We complement this result with an empirical evaluation showing that, while standard methods for outlier removal have a disproportionate impact across categories of input points, our algorithm equalizes the impact while retaining strong experimental performances on multiple real-world datasets.

References

[1]
Mohsen Abbasi, Aditya Bhaskara, and Suresh Venkatasubramanian. 2020. Fair clustering via equitable group representations. arXiv:2006.11009 (2020).
[2]
Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. 2009. Adaptive sampling for k-means clustering. In APPROX. Springer, 15--28.
[3]
Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. 2019. Clustering without over-representation. In KDD. 267--275.
[4]
Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. 2020. Fair Correlation clustering. In AISTATS .
[5]
Georg Anegg, Haris Angelidakis, Adam Kurpisz, and Rico Zenklusen. 2020. A Technique for Obtaining True Approximations for k-Center with Covering Constraints. In IPCO . Springer, 52--65.
[6]
David Arthur and Sergei Vassilvitskii. 2006. k-means++: The advantages of careful seeding. Technical Report. Stanford.
[7]
Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. 2019. Scalable fair clustering. arXiv:1902.03519 (2019).
[8]
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means++. arXiv:1203.6402 (2012).
[9]
Tanvi Bajpai, Deeparnab Chakrabarty, Chandra Chekuri, and Maryam Negahbani. 2021. Revisiting Priority $ k $-Center: Fairness and Outliers. arXiv preprint arXiv:2103.03337 (2021).
[10]
Sayan Bandyapadhyay, Tanmay Inamdar, Shreyas Pai, and Kasturi Varadarajan. 2019. A Constant Approximation for Colorful k-Center. In ESA .
[11]
Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and Machine Learning .fairmlbook.org. http://www.fairmlbook.org .
[12]
Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. 2019. Fair algorithms for clustering. In NeurIPS. 4954--4965.
[13]
Suman K Bera, Shalmoli Gupta, Amit Kumar, and Sambuddha Roy. 2014. Approximation algorithms for the partition vertex cover problem. Theor. Comput. Sci., Vol. 555 (2014), 2--8.
[14]
Alina Beygelzimer, Sham Kakade, and John Langford. 2006. Cover trees for nearest neighbor. In ICML. 97--104.
[15]
Aditya Bhaskara and Aravinda Kanchana Rwanpathirana. 2020. Robust Algorithms for Online $ k $-means Clustering. In ALT. 148--173.
[16]
Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. 2019. Greedy Sampling for Approximate Clustering in the Presence of Outliers. In NeurIPS . 11146--11155.
[17]
Nitin Bhatia et al. 2010. Survey of nearest neighbor techniques. arXiv:1007.0085 (2010).
[18]
Dan Biddle. 2006. Adverse impact and test validation: A practitioner's guide to valid and defensible employment testing .Gower Pub. Ltd.
[19]
Johannes Blömer, Christiane Lammersen, Melanie Schmidt, and Christian Sohler. 2016. Theoretical Analysis of the k-Means Algorithm -- A Survey .Springer, 81--116.
[20]
TH Hubert Chan, Arnaud Guerqin, and Mauro Sozio. 2018. Fully dynamic k-center clustering. In WWW. 579--587.
[21]
Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. 2001. Algorithms for facility location problems with outliers. In SODA. SIAM, 642--651.
[22]
Sanjay Chawla and Aristides Gionis. 2013. k-means--: A unified approach to clustering and outlier detection. In SDM. SIAM, 189--197.
[23]
Jiecao Chen, Erfan S Azer, and Qin Zhang. 2018. A practical algorithm for distributed clustering and outlier detection. In NeurIPS . 2248--2256.
[24]
Ke Chen. 2008. A constant factor approximation algorithm for k-median clustering with outliers. In SODA . 826--835.
[25]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. 2017. Fair clustering through fairlets. In NIPS. 5029--5037.
[26]
Sam Corbett-Davies and Sharad Goel. 2018. The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv:1808.00023 (2018).
[27]
Amit Deshpande, Praneeth Kacham, and Rameshwar Pratap. 2020. Robust k-means++. In UAI. PMLR, 799?808.
[28]
Hu Ding, Haikuo Yu, and Zixiu Wang. 2019. Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction. In ESA .
[29]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[30]
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In ITCS. 214--226.
[31]
Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In KDD .
[32]
Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R Salavatipour. 2019. Approximation schemes for clustering with outliers. TALG (2019).
[33]
Mehrdad Ghadiri, Samira Samadi, and Santosh Vempala. 2021. Socially fair k-means clustering. In FAccT. 438--448.
[34]
Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. 2017. Local search methods for k-means with outliers. VLDB, Vol. 10, 7 (2017), 757--768.
[35]
Sariel Har-Peled. 2011. Geometric approximation algorithms . Number 173. AMS.
[36]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. NIPS, Vol. 29 (2016), 3315--3323.
[37]
David G Harris, Thomas Pensyl, Aravind Srinivasan, and Khoa Trinh. 2019. A lottery model for center-type problems with outliers. TALG, Vol. 15, 3 (2019), 1--25.
[38]
Sungjin Im, Mahshid M Qaem, Benjamin Moseley, Xiaorui Sun, and Rudy Zhou. 2020. Fast Noise Removal for k-Means Clustering (PMLR, Vol. 108). 456--466.
[39]
Russell Impagliazzo and Ramamohan Paturi. 2001. On the complexity of k-SAT. J. Comput. System Sci., Vol. 62, 2 (2001), 367--375.
[40]
Tanmay Inamdar and Kasturi Varadarajan. 2018. On the Partition Set Cover Problem. arXiv:1809.06506 (2018).
[41]
Anil K Jain. 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters, Vol. 31, 8 (2010), 651--666.
[42]
Xinrui Jia, Kshiteej Sheth, and Ola Svensson. 2020. Fair Colorful k-Center Clustering. In IPCO. Springer, 209--222.
[43]
Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. 2002. A local search approximation algorithm for k-means clustering. In SOCG. 10--18.
[44]
Michael Kearns and Aaron Roth. 2019. The ethical algorithm: The science of socially aware algorithm design .Oxford University Press.
[45]
Aria Khademi, Sanghack Lee, David Foley, and Vasant Honavar. 2019. Fairness in algorithmic decision making: An excursion through the lens of causality. In WWW . 2907--2914.
[46]
Matth"aus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. 2019. Fair k-center clustering for data summarization. arXiv:1901.08628 (2019).
[47]
Emmanouil Krasanakis, Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2018. Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. In WWW . 853--862.
[48]
Robert Krauthgamer and James R Lee. 2004. Navigating nets: Simple algorithms for proximity search. In SODA. Citeseer, 798--807.
[49]
Silvio Lattanzi and Sergei Vassilvitskii. 2017. Consistent k-clustering. In ICML .
[50]
Shi Li and Ola Svensson. 2016. Approximating k-median via pseudo-approximation. SIAM J. Comput., Vol. 45, 2 (2016), 530--547.
[51]
Daniel Lokshtanov, Chinmay Sonar, Subhash Suri, and Jie Xue. 2020. Fair Covering of Points by Balls. (2020).
[52]
Sepideh Mahabadi and Ali Vakilian. 2020. Individual fairness for k-clustering. In ICML. PMLR, 6586--6596.
[53]
Adam Meyerson, Liadan O'callaghan, and Serge Plotkin. 2004. A k-median algorithm with running time independent of data size. Machine Learning (2004).
[54]
Burt L Monroe. 1995. Fully proportional representation. American Political Science Review (1995), 925--940.
[55]
Sérgio Moro, Paulo Cortez, and Paulo Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., Vol. 62 (2014), 22 -- 31.
[56]
Bruno Ordozgoiti and Aristides Gionis. 2019. Reconciliation k-median: Clustering with Non-polarized Representatives. In WWW . 1387--1397.
[57]
Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. 2017. On Fairness and Calibration. NeurIPS, Vol. 30 (2017), 5680--5689.
[58]
Jean Pouget-Abadie, Vahab Mirrokni, David C Parkes, and Edoardo M Airoldi. 2018. Optimizing cluster-based randomized experiments under monotonicity. In KDD . 2090--2099.
[59]
Dongmei Ren, Imad Rahal, William Perrizo, and Kirk Scott. 2004. A vertical distance-based outlier detection method with local pruning. In CIKM . 279--284.
[60]
Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. 2018. Fair coresets and streaming algorithms for fair k-means clustering. arXiv:1812.10854 (2018).
[61]
Anurag Shandilya, Kripabandhu Ghosh, and Saptarshi Ghosh. 2018. Fairness of Extractive Text Summarization. In WWW. 97--98.
[62]
J Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip K Chan. 2000. Cost-based modeling and evaluation for data mining with application to fraud and intrusion detection. Results from the JAM Project by Salvatore (2000), 1--15.
[63]
Beata Strack, Jonathan Deshazo, Chris Gennings, Juan Luis Olmo Ortiz, Sebastian Ventura, Krzysztof Cios, and John Clore. 2014. Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records. BioMed research international, Vol. 2014 (04 2014), 781670.
[64]
David P Williamson and David B Shmoys. 2011. The design of approximation algorithms .Cambridge University Press.
[65]
Peter N Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, Vol. 93. 311--21.
[66]
Hongjing Zhang and Ian Davidson. 2021. Towards Fair Deep Anomaly Detection. In FAccT. 138--148.
[67]
Ying Zhao and George Karypis. 2005. Data clustering in life sciences. Mol. biotechnol., Vol. 31, 1 (2005), 55--80.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
February 2022
1690 pages
ISBN:9781450391320
DOI:10.1145/3488560
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. coreset
  3. disparate impact
  4. fairness
  5. k-means
  6. outliers

Qualifiers

  • Research-article

Funding Sources

  • MIUR PRIN
  • MIUR
  • Google
  • BiCi ? Bertinoro international Center for informatics

Conference

WSDM '22

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1,090
    Total Downloads
  • Downloads (Last 12 months)304
  • Downloads (Last 6 weeks)34
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media