[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Faster Subquadratic Algorithm for Finding Outlier Correlations

Published: 16 June 2018 Publication History

Abstract

We study the problem of detecting outlier pairs of strongly correlated variables among a collection of n variables with otherwise weak pairwise correlations. After normalization, this task amounts to the geometric task where we are given as input a set of n vectors with unit Euclidean norm and dimension d, and for some constants 0<τ < ρ < 1, we are asked to find all the outlier pairs of vectors whose inner product is at least ρ in absolute value, subject to the promise that all but at most q pairs of vectors have inner product at most τ in absolute value.
Improving on an algorithm of Valiant [FOCS 2012; J. ACM 2015], we present a randomized algorithm that for Boolean inputs ({ −1,1}-valued data normalized to unit Euclidean length) runs in time Õ((nmax,{ 1−γ +M(Δ γ,γ),M(1−γ,2 Δ γ)}+qdn), where 0<γ < 1 is a constant tradeoff parameter and M(μ, ν) is the exponent to multiply an ⌊ nμ ⌋ × ⌊ nν ⌋ matrix with an ⌊ nν ⌋ × ⌊ nμ ⌋ matrix and Δ =1/(1−logτ ρ). As corollaries we obtain randomized algorithms that run in time Õ( (n2/ω 3−logτ ρ + qdn2/(1−logτ ρ)3=logττ ρ) and in time õ( (n4 / 2+α (1−logτ ρ)+qdn2/α (1−logτ ρ)2+α (1−logτ ρ)>), where 2≤ ω <2.38 is the exponent for square matrix multiplication and 0.3<α ≤ 1 is the exponent for  rectangular matrix multiplication. The notation Õ(ṡ) hides polylogarithmic factors in n and d whose degree may depend on ρ and τ. We present further corollaries for the light bulb problem and for learning sparse Boolean functions.

References

[1]
Amir Abboud, Richard Ryan Williams, and Huacheng Yu. 2015. More applications of the polynomial method to algorithm design. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’15). Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 218--230.
[2]
Panagiotis Achlioptas, Bernhard Schölkopf, and Karsten M. Borgwardt. 2011. Two-locus association mapping in subquadratic time. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 726--734.
[3]
Thomas D. Ahle, Rasmus Pagh, Ilya Razenshteyn, and Francesco Silvestri. 2016. On the complexity of inner product similarity join. In Proceedings of the 35th ACM Symposium on Principles of Database Systems (PODS’16). ACM, New York, NY, 151--164.
[4]
Josh Alman and Ryan Williams. 2015. Probabilistic polynomials and Hamming nearest neighbors. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS’15). IEEE Computer Society, Los Alamitos, CA, 136--150.
[5]
Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). IEEE Computer Society, Los Alamitos, CA, 459--468.
[6]
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., Red Hook, NY, 1225--1233.
[7]
Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. 2014. Beyond locality-sensitive hashing. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’14). Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1018--1028.
[8]
Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the 47th Annual ACM on Symposium on Theory of Computing (STOC’15). ACM, New York, NY, 793--801.
[9]
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 131--140.
[10]
Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). ACM, New York, NY, 42--53.
[11]
Avrim Blum, Adam Kalai, and Hal Wasserman. 2003. Noise-tolerant learning, the parity problem, and the statistical query model. J. ACM 50, 4 (2003), 506--519.
[12]
Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC’02). ACM, New York, NY, 380--388.
[13]
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE Computer Society, Los Alamitos, CA, 5.
[14]
Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang. 2001. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13, 1 (2001), 64--78.
[15]
Ryan R. Curtin, Alexander G. Gray, and Parikshit Ram. 2013. Fast exact max-kernel search. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM’13). Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1--9.
[16]
Abhinandan Das, Mayur Datar, Ashutosh Garg, and ShyamSundar Rajaram. 2007. Google news personalization: Scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 271--280.
[17]
Petros Drineas, Ravi Kannan, and Michael W. Mahoney. 2006. Fast Monte Carlo algorithms for matrices. I. Approximating matrix multiplication. SIAM J. Comput. 36, 1 (2006), 132--157.
[18]
Moshe Dubiner. 2010. Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans. Info. Theory 56, 8 (2010), 4166--4179.
[19]
Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami. 2009. On agnostic learning of parities, monomials, and halfspaces. SIAM J. Comput. 39, 2 (2009), 606--645.
[20]
Francis Galton. 1888. Co-relations and their measurement, chiefly from anthropometric data. Proc. Roy. Soc. London 45 (1888), 135--145.
[21]
Spencer Greenberg and Mehryar Mohri. 2014. Tight lower bound on the probability of a binomial exceeding its expectation. Stat. Probabil. Lett. 86 (2014), 91--98.
[22]
Elena Grigorescu, Lev Reyzin, and Santosh Vempala. 2011. On noise-tolerant learning of sparse parities and related problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT’11). Springer, Berlin, Germany, 413--424.
[23]
Wassily Hoeffding. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 (1963), 13--30.
[24]
Jianyu Huang, Leslie Rice, Devin A. Matthews, and Robert A. van de Geijn. 2017. Generating families of practical fast matrix multiplication algorithms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE Computer Society, Los Alamitos, CA, 656--667.
[25]
Russell Impagliazzo and Ramamohan Paturi. 2001. On the complexity of -SAT. J. Comput. Syst. Sci. 62, 2 (2001), 367--375.
[26]
Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. 2001. Which problems have strongly exponential complexity?J. Comput. Syst. Sci. 63, 4 (2001), 512--530.
[27]
Piotr Indyk. 2004. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry (2nd ed.). Chapman & Hall/CRC, Boca Raton, FL, 877--892.
[28]
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on the Theory of Computing (STOC’98). ACM, New York, NY, 604--613.
[29]
Michael Kapralov. 2015. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Proceedings of the 34th ACM Symposium on Principles of Database Systems (PODS’15). ACM, New York, NY, 329--342.
[30]
Matti Karppa, Petteri Kaski, Jukka Kohonen, and Padraig Ó. Catháin. 2016. Explicit correlation amplifiers for finding outlier correlations in deterministic subquadratic time. In Proceedings of the 24th Annual European Symposium on Algorithms (ESA’16). Schloss Dagstuhl--Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 52:1--52:17.
[31]
Noam Koenigstein, Parikshit Ram, and Yuval Shavitt. 2012. Efficient retrieval of recommendations in a matrix factorization framework. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). ACM, New York, NY, 535--544.
[32]
Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. 1998. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the 30th Annual ACM Symposium on the Theory of Computing (STOC’98). ACM, New York, NY, 614--623.
[33]
François Le Gall. 2012. Faster algorithms for rectangular matrix multiplication. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS’12). IEEE Computer Society, Los Alamitos, CA, 514--523.
[34]
François Le Gall. 2014. Powers of tensors and fast matrix multiplication. In Proceedings of the International Symposium on Symbolic and Algebraic Computation (ISSAC’14). ACM, New York, NY, 296--303.
[35]
Dongjoo Lee, Jaehui Park, Junho Shim, and Sang-goo Lee. 2010. An efficient similarity join algorithm with cosine similarity predicate. In Proceedings of the 21st International Conference on Database and Expert Systems Applications (DEXA’10). Springer, Berlin, Germany, 422--436.
[36]
Yucheng Low and Alice X. Zheng. 2012. Fast top-k similarity queries via matrix compression. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). ACM, New York, NY, 2070--2074.
[37]
Vadim Lyubashevsky. 2005. The parity problem in the presence of noise, decoding random linear codes, and the subset sum problem. In Proceedings of the 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX’05) and the 9th International Workshop on Randomization and Computation (RANDOM’05). Springer, Berlin, 378--389.
[38]
Alexander May and Ilya Ozerov. 2015. On computing nearest neighbors with applications to decoding of binary linear codes. In Proceedings of Advances in Cryptology (EUROCRYPT’15) and the 34th Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, Berlin, 203--228.
[39]
Stefan Meiser. 1993. Point location in arrangements of hyperplanes. Inf. Comput. 106, 2 (1993), 286--303.
[40]
Marvin L. Minsky and Seymour A. Papert. 1969. Perceptrons: An Introduction to Computational Geometry. MIT Press, Boston, MA.
[41]
Elchanan Mossel, Ryan O’Donnell, and Rocco A. Servedio. 2004. Learning functions of k relevant variables. J. Comput. Syst. Sci. 69, 3 (2004), 421--434.
[42]
Rajeev Motwani, Assaf Naor, and Rina Panigrahy. 2007. Lower bounds on locality sensitive hashing. SIAM J. Discrete Math. 21, 4 (2007), 930--935.
[43]
Behnam Neyshabur and Nathan Srebro. 2015. On symmetric and asymmetric LSHs for inner product search. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). Journal of Machine Learning Research, 1926--1934.
[44]
Ryan O’Donnell, Yi Wu, and Yuan Zhou. 2014. Optimal lower bounds for locality-sensitive hashing (except when q is tiny). ACM TOCT 6, 1 (2014), Article 5.
[45]
Rasmus Pagh. 2013. Compressed matrix multiplication. ACM TOCT 5, 3 (2013), Article 9.
[46]
Rasmus Pagh. 2016. Locality-sensitive hashing without false negatives. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’16). Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1--9.
[47]
Ramamohan Paturi, Sanguthevar Rajasekaran, and John H. Reif. 1989. The light bulb problem. In Proceedings of the Second Annual Workshop on Computational Learning Theory (COLT’89). ACM, New York, NY, 261--268.
[48]
Karl Pearson. 1920. Notes on the history of correlation. Biometrika 13, 1 (1920), 25--45.
[49]
Parikshit Ram and Alexander G. Gray. 2012. Maximum inner-product search using cone trees. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). ACM, New York, NY, 931--939.
[50]
Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14). Curran Associates, Inc., Red Hook, NY, 2321--2329.
[51]
Anshumali Shrivastava and Ping Li. 2015. Asymmetric minwise hashing for indexing binary inner products and set containment. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). ACM, New York, NY, 981--991.
[52]
Yasin N. Silva, Walid G. Aref, and Mohamed H. Ali. 2010. The similarity join database operator. In Proceedings of the 26th International Conference on Data Engineering (ICDE’10). IEEE Computer Society, Los Alamitos, CA, 892--903.
[53]
Stephen M. Stigler. 1989. Francis galton’s account of the invention of correlation. Statist. Sci. 4, 2 (1989), 73--86.
[54]
Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk. 2015. LEMP: Fast retrieval of large entries in a matrix product. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 107--122.
[55]
Gregory Valiant. 2015. Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem. J. ACM 62, 2 (2015), Article 13.
[56]
Leslie G. Valiant. 1988. Functionality in neural nets. In Proceedings of the 1st Annual Workshop on Computational Learning Theory (COLT’88). ACM, New York, NY, 28--39.
[57]
Jiannan Wang, Guoliang Li, and Jianhua Feng. 2011. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proceedings of the 27th International Conference on Data Engineering (ICDE’11). IEEE Computer Society, Los Alamitos, CA, 458--469.
[58]
Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’12). ACM, New York, NY, 85--96.
[59]
Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. 2013. Scalable all-pairs similarity search in metric spaces. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13). ACM, New York, NY, 829--837.
[60]
Roger Weber, Hans-Jörg Schek, and Stephen Blott. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of 24rd International Conference on Very Large Data Bases (VLDB’98). Morgan Kaufmann Publishers, Burlington, MA, 194--205.
[61]
Ryan Williams. 2005. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci. 348, 2--3 (2005), 357--365.
[62]
Ryan Williams and Huacheng Yu. 2014. Finding orthogonal vectors in discrete structures. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’14). Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1867--1877.
[63]
Chenyi Xia, Hongjun Lu, Beng Chin Ooi, and Jin Hu. 2004. Gorder: An efficient method for KNN join processing. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB’04). Morgan Kaufmann Publishers, Burlington, MA, 756--767.
[64]
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36, 3 (2011), Article 15.
[65]
Reza Bosagh Zadeh and Ashish Goel. 2013. Dimension independent similarity computation. J. Mach. Learn. Res. 14, 1 (2013), 1605--1626.
[66]
Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. 2006. Similarity Search. The Metric Space Approach. Advances in Database Systems, Vol. 32. Springer, New York, NY.

Cited By

View all
  • (2024)A subquadratic time algorithm for robust sparse mean estimationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693708(40371-40396)Online publication date: 21-Jul-2024
  • (2023)Generalizations of Matrix Multiplication can solve the Light Bulb Problem2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00090(1471-1495)Online publication date: 6-Nov-2023
  • (2021)Convex transform order of Beta distributions with some consequencesStatistica Neerlandica10.1111/stan.1223375:3(238-256)Online publication date: 25-Jan-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms  Volume 14, Issue 3
Special Issue on SODA’16 and Regular Papers
July 2018
393 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3233176
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2018
Accepted: 01 December 2017
Received: 01 March 2016
Published in TALG Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Correlation
  2. fast matrix multiplication
  3. light bulb problem
  4. rectangular matrix multiplication
  5. similarity search

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • European Research Council under the European Union's Seventh Framework Programme
  • ERC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)7
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A subquadratic time algorithm for robust sparse mean estimationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693708(40371-40396)Online publication date: 21-Jul-2024
  • (2023)Generalizations of Matrix Multiplication can solve the Light Bulb Problem2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00090(1471-1495)Online publication date: 6-Nov-2023
  • (2021)Convex transform order of Beta distributions with some consequencesStatistica Neerlandica10.1111/stan.1223375:3(238-256)Online publication date: 25-Jan-2021
  • (2021)Approximate Top-k Inner Product Join with a Proximity Graph2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671858(4468-4471)Online publication date: 15-Dec-2021
  • (2019)Prediction of Oilfield‐Increased Production Using Adaptive Neurofuzzy Inference System with Smoothing TreatmentMathematical Problems in Engineering10.1155/2019/48657122019:1Online publication date: 19-Dec-2019

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media