[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Sampling near neighbors in search for fairness

Published: 21 July 2022 Publication History

Abstract

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r-near neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. The problem is of special interest in high dimensions, where Locality Sensitive Hashing (LSH), the theoretically leading approach to similarity search, does not provide any fairness guarantee. In this work, we show that LSH-based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also carried out an experimental evaluation that highlights the inherent unfairness of existing NN data structures.

References

[1]
Adomavicius, G., Kwon, Y. Optimization-based approaches for maximizing aggregate recommendation diversity. INFORMS J. Comput 26, 2 (2014), 351--369.
[2]
Afshani, P., Phillips, J.M. Independent range sampling, revisited again. In G. Barequet, and Y. Wang, eds. Proc. 35th Int. Symposium on Computational Geometry (SoCG), volume 129 of LIPIcs (2019), 4:1--4:13.
[3]
Ahle, T.D., Aumüller, M., Pagh, R. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. 28th ACM-SIAM Symposium on Discrete Algorithms (SODA) (2017), 239--256.
[4]
Alman, J., Williams, R. Probabilistic polynomials and hamming nearest neighbors. In Proc. IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS) (2015), 136--150.
[5]
Andoni, A., Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1 (2008), 117--122.
[6]
Aumüller, M., Christiani, T., Pagh, R., Silvestri, F. Distance-sensitive hashing. In Proc. 37th ACM Symposium on Principles of Database Systems (PODS) (2018).
[7]
Aumüller, M., Har-Peled, S., Mahabadi, S., Pagh, R., Silvestri, F. Fair near neighbor search via sampling. SIGMOD Rec 50, 1 (2021), 42--49.
[8]
Aumüller, M., Har-Peled, S., Mahabadi, S., Pagh, R., Silvestri, F. Sampling a near neighbor in high dimensions---Who is the fairest of them all? to appear in ACM Transaction of Database Systems (2022.).
[9]
Aumüller, M., Pagh, R., Silvestri, F. Fair near neighbor search: Independent range sampling in high dimensions. In Proc. 39th ACM Symposium on Principles of Database Systems (PODS) (2020).
[10]
Broder, A.Z. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences (1997), 21--29.
[11]
Charikar, M., Siminelakis, P. Hashing-based-estimators for kernel density in high dimensions. In C. Umans, ed. Proc. 58th IEEE Symposium on Foundations of Computer Science (FOCS) (2017), 1032--1043.
[12]
Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5, 2 (2017), 153--163.
[13]
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th Symposium on Computational Geometry (SoCG) (2004), 253--262.
[14]
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R. Fairness through awareness. In Proc. 3rd Innovations in Theoretical Computer Science Conference (ITCS) (2012), 214--226.
[15]
Everitt, B.S., Landau, S., Leese, M. Cluster Analysis. Wiley Publishing, 2009.
[16]
Har-Peled, S., Indyk, P., Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. Theory Comput 8 (2012), 321--350. Special issue in honor of Rajeev Motwani.
[17]
Har-Peled, S., Mahabadi, S. Near neighbor: Who is the fairest of them all? In Proc. 32nd Neural Info. Proc. Sys. (NeurIPS) (2019), 13176--13187.
[18]
Hardt, M., Price, E., Srebro, N. Equality of opportunity in supervised learning. In Neural Info. Proc. Sys. (NIPS) (2016), 3315--3323.
[19]
Hu, X., Qiao, M., Tao, Y. Independent range sampling. In Proc. 33rd ACM Symposium on Principles of Database Systems (PODS) (2014), 246--255.
[20]
Indyk, P., Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC) (1998), 604--613.
[21]
Karp, R.M., Luby, M. Monte-Carlo algorithms for enumeration and reliability problems. In 24th Symposium on Foundations of Computer Science (SFCS), IEEE Computer Society, 1983, 56--64.
[22]
Keeling, M.J., Eames, K.T. Networks and epidemic models. J. R. Soc. Interface 2, 4 (Sep. 2005), 295--307.
[23]
Kung Y-H, Lin, P.-S., Kao, C.-H. An optimal k-nearest neighbor for density estimation. Stat. Probab. Lett 82, 10 (2012), 1786--1791.
[24]
Olken, F., Rotem, D. Sampling from spatial databases. Stat. Comput 5, 1 (Mar 1995), 43--57.
[25]
Qi, Y., Atallah, M.J. Efficient privacy-preserving k-nearest neighbor search. In Proc. 28th International Conference on Distributed Computing Systems (ICDCS) (2008), 311--319.
[26]
Riazi, M.S., Chen, B., Shrivastava, A., Wallach, D.S., Koushanfar, F. Sublinear privacy-preserving near-neighbor search with untrusted server on large-scale datasets. arXiv:1612.01835 (2016).
[27]
Thanh, B.L., Ruggieri, S., Turini, F. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proc. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2011), 502--510.

Cited By

View all
  • (2024)Independent Range Sampling on Interval Data2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00041(449-461)Online publication date: 13-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 65, Issue 8
August 2022
91 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3550455
  • Editor:
  • James Larus
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 July 2022
Published in CACM Volume 65, Issue 8

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5,240
  • Downloads (Last 6 weeks)29
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Independent Range Sampling on Interval Data2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00041(449-461)Online publication date: 13-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media