More Web Proxy on the site http://driver.im/

article

A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Authors:

Kenneth W. ChurchAuthors Info & Claims

Computational Linguistics, Volume 33, Issue 3

Pages 305 - 354

https://doi.org/10.1162/coli.2007.33.3.305

Published: 01 September 2007 Publication History

Abstract

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do better by taking advantage of the margins (also known as document frequencies). The proposed method cuts the errors roughly in half over Broder's sketches.

References

[1]

Achlioptas, Dimitris. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671-687.

[2]

Aggarwal, Charu C., Cecilia Magdalena Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Park. 1999. Fast algorithms for projected clustering. In SIGMOD, pages 61-72, Philadelphia, PA.

[3]

Aggarwal, Charu C. and Joel L. Wolf. 1999. A new method for similarity indexing of market basket data. In SIGMOD, pages 407-418, Philadelphia, PA.

[4]

Agrawal, Rakesh, Tomasz Imielinski, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In SIGMOD, pages 207-216, Washington, DC.

[5]

Agrawal, Rakesh, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. 1996. Fast discovery of association rules. In U. M. Fayyad, G. Pratetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, pages 307-328, Cambridge, MA.

[6]

Agrawal, Rakesh and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules in large databases. In VLDB, pages 487-499, Santiago de Chile, Chile.

[7]

Agresti, Alan. 2002. Categorical Data Analysis. John Wiley & Sons, Inc., Hoboken, NJ, second edition.

[8]

Alon, Noga, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In STOC, pages 20-29, Philadelphia, PA.

[9]

Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press, New York, NY.

[10]

Boyd, Stephen and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press, Cambridge, UK.

[11]

Brin, Sergey, James Davis, and Hector Garcia-Molina. 1995. Copy detection mechanisms for digital documents. In SIGMOD, pages 398-409, San Jose, CA.

[12]

Brin, Sergey and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107-117, Brisbane, Australia.

[13]

Brin, Sergy, Rajeev Motwani, and Craig Silverstein. 1997. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD, pages 265-276, Tucson, AZ.

[14]

Brin, Sergy, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. 1997. Dynamic itemset counting and implication rules for market basket data. In SIGMOD, pages 265-276, Tucson, AZ.

[15]

Broder, Andrei Z. 1997. On the resemblance and containment of documents. In The Compression and Complexity of Sequences, pages 21-29, Positano, Italy.

[16]

Broder, Andrei Z. 1998. Filtering near-duplicate documents. In FUN, Isola d'Elba, Italy.

[17]

Broder, Andrei Z., Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. Min-wise independent permutations (extended abstract). In STOC, pages 327-336, Dallas, TX.

[18]

Broder, Andrei Z., Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 2000. Min-wise independent permutations. Journal of Computer Systems and Sciences, 60(3):630-659.

[19]

Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. In WWW, pages 1157-1166, Santa Clara, CA.

[20]

Charikar, Moses S. 2002. Similarity estimation techniques from rounding algorithms. In STOC, pages 380-388, Montreal, Canada.

[21]

Chaudhuri Surajit, Rajeev Motwani, and Vivek R. Narasayya. 1998. Random sampling for histogram construction: How much is enough? In SIGMOD, pages 436-447, Seattle, WA.

[22]

Chaudhuri, Surajit, Rajeev Motwani, and Vivek R. Narasayya. 1999. On random sampling over joins. In SIGMOD, pages 263-274, Philadelphia, PA.

[23]

Chen, Bin, Peter Haas, and Peter Scheuermann. 2002. New two-phase sampling based algorithm for discovering association rules. In KDD, pages 462-468, Edmonton, Canada.

[24]

Church, Kenneth and Patrick Hanks. 1991. Word association norms, mutual information and lexicography. Computational Linguistics, 16(1):22-29.

[25]

Cover, Thomas M. and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley & Sons, Inc., New York, NY.

[26]

David, Herbert A. 1981. Order Statistics. John Wiley & Sons, Inc., New York, NY, second edition.

[27]

Deming, W. Edwards and Frederick F. Stephan. 1940. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4):427-444.

[28]

Drineas, Petros and Michael W. Mahoney. 2005. Approximating a gram matrix for improved kernel-based learning. In COLT, pages 323-337, Bertinoro, Italy.

[29]

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74.

[30]

Etzioni, Oren, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2004. Web-scale information extraction in knowitall (preliminary results). In WWW, pages 100-110, New York, NY.

[31]

Garcia-Molina, Hector, Jeffrey D. Ullman, and Jennifer Widom. 2002. Database Systems: The Complete Book. Prentice Hall, New York, NY.

[32]

Gilbert, Anna C., Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. 2003. One-pass wavelet decompositions of data streams. IEEE Transactions on Knowledge and Data Engineering, 15(3):541-554.

[33]

Guha Sudipto, Rajeev Rastogi, and Kyuseok Shim. 1998. Cure: An efficient clustering algorithm for large databases. In SIGMOD, pages 73-84, Seattle, WA.

[34]

Hastie, T., R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY.

[35]

Haveliwala, Taher H., Aristides Gionis, and Piotr Indyk. 2000. Scalable techniques for clustering the Web. In WebDB, pages 129-134, Dallas, TX.

[36]

Haveliwala, Taher H., Aristides Gionis, Dan Klein, and Piotr Indyk. 2002. Evaluating strategies for similarity search on the web. In WWW, pages 432-442, Honolulu, HI.

[37]

Hidber, Christian. 1999. Online association rule mining. In SIGMOD, pages 145-156, Philadelphia, PA.

[38]

Hornby, Albert Sydney, editor. 1989. Oxford Advanced Learner's Dictionary of Current English. Oxford University Press, Oxford, UK, fourth edition.

[39]

Indyk, Piotr. 2001. A small approximately min-wise independent family of hash functions. Journal of Algorithm, 38(1):84-90.

[40]

Indyk, Piotr and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604-613, Dallas, TX.

[41]

Itoh, Toshiya, Yoshinori Takei, and Jun Tarui. 2003. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions. In STOC, pages 710-718, San Diego, CA.

[42]

Lehmann, Erich L. and George Casella. 1998. Theory of Point Estimation. Springer, New York, NY, second edition.

[43]

Li, Ping. 2006. Very sparse stable random projections, estimators and tail bounds for stable random projections. Technical report, available from http://arxiv.org/PS_cache/cs/pdf/ 0611/0611114v2.pdf.

[44]

Li, Ping and Kenneth W. Church. 2005. Using sketches to estimate two-way and multi-way associations. Technical Report TR-2005-115, Microsoft Research, Redmond, WA, September.

[45]

Li, Ping, Kenneth W. Church, and Trevor J. Hastie. 2006. Conditional random sampling: A sketched-based sampling technique for sparse data. Technical Report 2006-08, Department of Statistics, Stanford University.

[46]

Li, Ping, Kenneth W. Church, and Trevor J. Hastie. 2007. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873-880. Vancouver, BC, Canada.

[47]

Li, Ping, Trevor J. Hastie, and Kenneth W. Church. 2006a. Improving random projections using marginal information. In COLT, pages 635-649, Pittsburgh, PA.

[48]

Li, Ping, Trevor J. Hastie, and Kenneth W. Church. 2006b. Very sparse random projections. In KDD, pages 287-296, Philadelphia, PA.

[49]

Li, Ping, Trevor J. Hastie, and Kenneth W. Church. 2007. Nonlinear estimators and tail bounds for dimensional reduction in l₁ using Cauchy random projections. In COLT, pages 514-529, San Diego, CA.

[50]

Manku, Gurmeet Singh, Sridhar Rajagopalan, and Bruce G. Lindsay. 1999. Random sampling techniques for space efficient online computation of order statistics of large datasets. In SIGCOMM, pages 251-262, Philadelphia, PA.

[51]

Manning, Chris D. and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA.

[52]

Matias, Yossi, Jeffrey Scott Vitter, and Min Wang. 1998. Wavelet-based histograms for selectivity estimation. In SIGMOD, pages 448-459, Seattle, WA.

[53]

Moore, Robert C. 2004. On log-likelihood-ratios and the significance of rare events. In EMNLP, pages 333-340, Barcelona, Spain.

[54]

Pearsall, Judy, editor. 1998. The New Oxford Dictionary of English. Oxford University Press, Oxford, UK.

[55]

Ravichandran, Deepak, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and NLP: Using locality sensitive hash function for high speed noun clustering. In ACL, pages 622-629, Ann Arbor, MI.

[56]

Rosen, Bengt. 1972a. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373-397.

[57]

Rosen, Bengt. 1972b. Asymptotic theory for successive sampling with varying probabilities without replacement, II. The Annals of Mathematical Statistics, 43(3):748-776.

[58]

Salton, Gerard. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, New York, NY.

[59]

Stephan, Frederick F. 1942. An iterative method of adjusting sample frequency tables when expected marginal totals are known. The Annals of Mathematical Statistics, 13(2):166-178.

[60]

Strehl, Alexander and Joydeep Ghosh. 2000. A scalable approach to balanced, high-dimensional clustering of market-baskets. In HiPC, pages 525-536, Bangalore, India.

[61]

Toivonen, Hannu. 1996. Sampling large databases for association rules. In VLDB, pages 134-145, Bombay, India.

[62]

Vempala, Santosh. 2004. The Random Projection Method. American Mathematical Society, Providence, RI.

[63]

Witten, Ian H., Alstair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, CA, second edition.

Cited By

Nazi ADing BNarasayya VChaudhuri S(2018)Efficient estimation of inclusion coefficient using hyperloglog sketchesProceedings of the VLDB Endowment10.14778/3231751.323175911:10(1097-1109)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.14778/3231751.3231759
Rachkovskij D(2017)Binary Vectors for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-017-9914-x53:1(138-156)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1007/s10559-017-9914-x
Casteleiro JLopes GSilva J(2014)Bilingually Learning Word Senses for TranslationProceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 840410.1007/978-3-642-54903-8_24(283-295)Online publication date: 6-Apr-2014
https://dl.acm.org/doi/10.1007/978-3-642-54903-8_24
Show More Cited By

Recommendations

Synchronization by two-way message exchanges: Cramér-Rao bounds, approximate maximum likelihood, and offshore submarine positioning

Accurate clock synchronization is vital to many applications of wireless sensor networks (WSNs). The availability of a mathematical tool that at an early design stage can provide insight into the theoretically achievable performance of the clock ...
Estimating parameters in one-way analysis of covariance model with short-tailed symmetric error distributions

We consider one-way analysis of covariance (ANCOVA) model with a single covariate when the distribution of error terms are short-tailed symmetric. The maximum likelihood (ML) estimators of the parameters are intractable. We, therefore, employ a simple ...
Joint estimation of the channel and I/Q imbalance for two-way relay networks
Abstract
In this work, we consider amplify-and-forward two-way relay networks operating under the imperfect conditions of in-phase and quadrature-phase (I/Q) imbalance. We propose novel, efficient solutions for the problem of joint channel and I/Q ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computational Linguistics

Computational Linguistics Volume 33, Issue 3

September 2007

154 pages

ISSN:0891-2017

EISSN:1530-9312

Issue’s Table of Contents

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 September 2007

Published in COLI Volume 33, Issue 3

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
442
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nazi ADing BNarasayya VChaudhuri S(2018)Efficient estimation of inclusion coefficient using hyperloglog sketchesProceedings of the VLDB Endowment10.14778/3231751.323175911:10(1097-1109)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.14778/3231751.3231759
Rachkovskij D(2017)Binary Vectors for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-017-9914-x53:1(138-156)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1007/s10559-017-9914-x
Casteleiro JLopes GSilva J(2014)Bilingually Learning Word Senses for TranslationProceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 840410.1007/978-3-642-54903-8_24(283-295)Online publication date: 6-Apr-2014
https://dl.acm.org/doi/10.1007/978-3-642-54903-8_24
Kjersten BVan Durme BChu-Carroll J(2012)Space efficiencies in discourse modeling via conditional random samplingProceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies10.5555/2382029.2382103(513-517)Online publication date: 3-Jun-2012
https://dl.acm.org/doi/10.5555/2382029.2382103
Wang YLi HWang HZhu K(2012)Concept-based web searchProceedings of the 31st international conference on Conceptual Modeling10.1007/978-3-642-34002-4_35(449-462)Online publication date: 15-Oct-2012
https://dl.acm.org/doi/10.1007/978-3-642-34002-4_35
Li PKönig A(2011)Theory and applications of b-bit minwise hashingCommunications of the ACM10.1145/1978542.197856654:8(101-109)Online publication date: 1-Aug-2011
https://dl.acm.org/doi/10.1145/1978542.1978566
Li PMahoney MShe Y(2010)Approximating higher-order distances using random projectionsProceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence10.5555/3023549.3023586(312-321)Online publication date: 8-Jul-2010
https://dl.acm.org/doi/10.5555/3023549.3023586
Li PKönig AGui W(2010)b-Bit minwise hashing for estimating three-way similaritiesProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 210.5555/2997046.2997051(1387-1395)Online publication date: 6-Dec-2010
https://dl.acm.org/doi/10.5555/2997046.2997051
Goyal AJagarlamudi JDaumé HVenkatasubramanian SKilgarriff ALin DSharoff S(2010)Sketching techniques for large scale NLPProceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop10.5555/1868765.1868768(17-25)Online publication date: 5-Jun-2010
https://dl.acm.org/doi/10.5555/1868765.1868768
Yang MWang HLim LWang MElmagarmid AAgrawal D(2010)Optimizing content freshness of relations extracted from the web using keyword searchProceedings of the 2010 ACM SIGMOD International Conference on Management of data10.1145/1807167.1807256(819-830)Online publication date: 6-Jun-2010
https://dl.acm.org/doi/10.1145/1807167.1807256
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents