[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Published: 01 September 2007 Publication History

Abstract

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do better by taking advantage of the margins (also known as document frequencies). The proposed method cuts the errors roughly in half over Broder's sketches.

References

[1]
Achlioptas, Dimitris. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671-687.
[2]
Aggarwal, Charu C., Cecilia Magdalena Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Park. 1999. Fast algorithms for projected clustering. In SIGMOD, pages 61-72, Philadelphia, PA.
[3]
Aggarwal, Charu C. and Joel L. Wolf. 1999. A new method for similarity indexing of market basket data. In SIGMOD, pages 407-418, Philadelphia, PA.
[4]
Agrawal, Rakesh, Tomasz Imielinski, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In SIGMOD, pages 207-216, Washington, DC.
[5]
Agrawal, Rakesh, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. 1996. Fast discovery of association rules. In U. M. Fayyad, G. Pratetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, pages 307-328, Cambridge, MA.
[6]
Agrawal, Rakesh and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules in large databases. In VLDB, pages 487-499, Santiago de Chile, Chile.
[7]
Agresti, Alan. 2002. Categorical Data Analysis. John Wiley & Sons, Inc., Hoboken, NJ, second edition.
[8]
Alon, Noga, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In STOC, pages 20-29, Philadelphia, PA.
[9]
Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press, New York, NY.
[10]
Boyd, Stephen and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press, Cambridge, UK.
[11]
Brin, Sergey, James Davis, and Hector Garcia-Molina. 1995. Copy detection mechanisms for digital documents. In SIGMOD, pages 398-409, San Jose, CA.
[12]
Brin, Sergey and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107-117, Brisbane, Australia.
[13]
Brin, Sergy, Rajeev Motwani, and Craig Silverstein. 1997. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD, pages 265-276, Tucson, AZ.
[14]
Brin, Sergy, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. 1997. Dynamic itemset counting and implication rules for market basket data. In SIGMOD, pages 265-276, Tucson, AZ.
[15]
Broder, Andrei Z. 1997. On the resemblance and containment of documents. In The Compression and Complexity of Sequences, pages 21-29, Positano, Italy.
[16]
Broder, Andrei Z. 1998. Filtering near-duplicate documents. In FUN, Isola d'Elba, Italy.
[17]
Broder, Andrei Z., Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. Min-wise independent permutations (extended abstract). In STOC, pages 327-336, Dallas, TX.
[18]
Broder, Andrei Z., Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 2000. Min-wise independent permutations. Journal of Computer Systems and Sciences, 60(3):630-659.
[19]
Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. In WWW, pages 1157-1166, Santa Clara, CA.
[20]
Charikar, Moses S. 2002. Similarity estimation techniques from rounding algorithms. In STOC, pages 380-388, Montreal, Canada.
[21]
Chaudhuri Surajit, Rajeev Motwani, and Vivek R. Narasayya. 1998. Random sampling for histogram construction: How much is enough? In SIGMOD, pages 436-447, Seattle, WA.
[22]
Chaudhuri, Surajit, Rajeev Motwani, and Vivek R. Narasayya. 1999. On random sampling over joins. In SIGMOD, pages 263-274, Philadelphia, PA.
[23]
Chen, Bin, Peter Haas, and Peter Scheuermann. 2002. New two-phase sampling based algorithm for discovering association rules. In KDD, pages 462-468, Edmonton, Canada.
[24]
Church, Kenneth and Patrick Hanks. 1991. Word association norms, mutual information and lexicography. Computational Linguistics, 16(1):22-29.
[25]
Cover, Thomas M. and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley & Sons, Inc., New York, NY.
[26]
David, Herbert A. 1981. Order Statistics. John Wiley & Sons, Inc., New York, NY, second edition.
[27]
Deming, W. Edwards and Frederick F. Stephan. 1940. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4):427-444.
[28]
Drineas, Petros and Michael W. Mahoney. 2005. Approximating a gram matrix for improved kernel-based learning. In COLT, pages 323-337, Bertinoro, Italy.
[29]
Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74.
[30]
Etzioni, Oren, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2004. Web-scale information extraction in knowitall (preliminary results). In WWW, pages 100-110, New York, NY.
[31]
Garcia-Molina, Hector, Jeffrey D. Ullman, and Jennifer Widom. 2002. Database Systems: The Complete Book. Prentice Hall, New York, NY.
[32]
Gilbert, Anna C., Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. 2003. One-pass wavelet decompositions of data streams. IEEE Transactions on Knowledge and Data Engineering, 15(3):541-554.
[33]
Guha Sudipto, Rajeev Rastogi, and Kyuseok Shim. 1998. Cure: An efficient clustering algorithm for large databases. In SIGMOD, pages 73-84, Seattle, WA.
[34]
Hastie, T., R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY.
[35]
Haveliwala, Taher H., Aristides Gionis, and Piotr Indyk. 2000. Scalable techniques for clustering the Web. In WebDB, pages 129-134, Dallas, TX.
[36]
Haveliwala, Taher H., Aristides Gionis, Dan Klein, and Piotr Indyk. 2002. Evaluating strategies for similarity search on the web. In WWW, pages 432-442, Honolulu, HI.
[37]
Hidber, Christian. 1999. Online association rule mining. In SIGMOD, pages 145-156, Philadelphia, PA.
[38]
Hornby, Albert Sydney, editor. 1989. Oxford Advanced Learner's Dictionary of Current English. Oxford University Press, Oxford, UK, fourth edition.
[39]
Indyk, Piotr. 2001. A small approximately min-wise independent family of hash functions. Journal of Algorithm, 38(1):84-90.
[40]
Indyk, Piotr and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604-613, Dallas, TX.
[41]
Itoh, Toshiya, Yoshinori Takei, and Jun Tarui. 2003. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions. In STOC, pages 710-718, San Diego, CA.
[42]
Lehmann, Erich L. and George Casella. 1998. Theory of Point Estimation. Springer, New York, NY, second edition.
[43]
Li, Ping. 2006. Very sparse stable random projections, estimators and tail bounds for stable random projections. Technical report, available from http://arxiv.org/PS_cache/cs/pdf/ 0611/0611114v2.pdf.
[44]
Li, Ping and Kenneth W. Church. 2005. Using sketches to estimate two-way and multi-way associations. Technical Report TR-2005-115, Microsoft Research, Redmond, WA, September.
[45]
Li, Ping, Kenneth W. Church, and Trevor J. Hastie. 2006. Conditional random sampling: A sketched-based sampling technique for sparse data. Technical Report 2006-08, Department of Statistics, Stanford University.
[46]
Li, Ping, Kenneth W. Church, and Trevor J. Hastie. 2007. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873-880. Vancouver, BC, Canada.
[47]
Li, Ping, Trevor J. Hastie, and Kenneth W. Church. 2006a. Improving random projections using marginal information. In COLT, pages 635-649, Pittsburgh, PA.
[48]
Li, Ping, Trevor J. Hastie, and Kenneth W. Church. 2006b. Very sparse random projections. In KDD, pages 287-296, Philadelphia, PA.
[49]
Li, Ping, Trevor J. Hastie, and Kenneth W. Church. 2007. Nonlinear estimators and tail bounds for dimensional reduction in l1 using Cauchy random projections. In COLT, pages 514-529, San Diego, CA.
[50]
Manku, Gurmeet Singh, Sridhar Rajagopalan, and Bruce G. Lindsay. 1999. Random sampling techniques for space efficient online computation of order statistics of large datasets. In SIGCOMM, pages 251-262, Philadelphia, PA.
[51]
Manning, Chris D. and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA.
[52]
Matias, Yossi, Jeffrey Scott Vitter, and Min Wang. 1998. Wavelet-based histograms for selectivity estimation. In SIGMOD, pages 448-459, Seattle, WA.
[53]
Moore, Robert C. 2004. On log-likelihood-ratios and the significance of rare events. In EMNLP, pages 333-340, Barcelona, Spain.
[54]
Pearsall, Judy, editor. 1998. The New Oxford Dictionary of English. Oxford University Press, Oxford, UK.
[55]
Ravichandran, Deepak, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and NLP: Using locality sensitive hash function for high speed noun clustering. In ACL, pages 622-629, Ann Arbor, MI.
[56]
Rosen, Bengt. 1972a. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373-397.
[57]
Rosen, Bengt. 1972b. Asymptotic theory for successive sampling with varying probabilities without replacement, II. The Annals of Mathematical Statistics, 43(3):748-776.
[58]
Salton, Gerard. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, New York, NY.
[59]
Stephan, Frederick F. 1942. An iterative method of adjusting sample frequency tables when expected marginal totals are known. The Annals of Mathematical Statistics, 13(2):166-178.
[60]
Strehl, Alexander and Joydeep Ghosh. 2000. A scalable approach to balanced, high-dimensional clustering of market-baskets. In HiPC, pages 525-536, Bangalore, India.
[61]
Toivonen, Hannu. 1996. Sampling large databases for association rules. In VLDB, pages 134-145, Bombay, India.
[62]
Vempala, Santosh. 2004. The Random Projection Method. American Mathematical Society, Providence, RI.
[63]
Witten, Ian H., Alstair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, CA, second edition.

Cited By

View all
  • (2018)Efficient estimation of inclusion coefficient using hyperloglog sketchesProceedings of the VLDB Endowment10.14778/3231751.323175911:10(1097-1109)Online publication date: 1-Jun-2018
  • (2017)Binary Vectors for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-017-9914-x53:1(138-156)Online publication date: 1-Jan-2017
  • (2014)Bilingually Learning Word Senses for TranslationProceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 840410.1007/978-3-642-54903-8_24(283-295)Online publication date: 6-Apr-2014
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computational Linguistics
Computational Linguistics  Volume 33, Issue 3
September 2007
154 pages
ISSN:0891-2017
EISSN:1530-9312
Issue’s Table of Contents

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 September 2007
Published in COLI Volume 33, Issue 3

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Efficient estimation of inclusion coefficient using hyperloglog sketchesProceedings of the VLDB Endowment10.14778/3231751.323175911:10(1097-1109)Online publication date: 1-Jun-2018
  • (2017)Binary Vectors for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-017-9914-x53:1(138-156)Online publication date: 1-Jan-2017
  • (2014)Bilingually Learning Word Senses for TranslationProceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 840410.1007/978-3-642-54903-8_24(283-295)Online publication date: 6-Apr-2014
  • (2012)Space efficiencies in discourse modeling via conditional random samplingProceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies10.5555/2382029.2382103(513-517)Online publication date: 3-Jun-2012
  • (2012)Concept-based web searchProceedings of the 31st international conference on Conceptual Modeling10.1007/978-3-642-34002-4_35(449-462)Online publication date: 15-Oct-2012
  • (2011)Theory and applications of b-bit minwise hashingCommunications of the ACM10.1145/1978542.197856654:8(101-109)Online publication date: 1-Aug-2011
  • (2010)Approximating higher-order distances using random projectionsProceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence10.5555/3023549.3023586(312-321)Online publication date: 8-Jul-2010
  • (2010)b-Bit minwise hashing for estimating three-way similaritiesProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 210.5555/2997046.2997051(1387-1395)Online publication date: 6-Dec-2010
  • (2010)Sketching techniques for large scale NLPProceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop10.5555/1868765.1868768(17-25)Online publication date: 5-Jun-2010
  • (2010)Optimizing content freshness of relations extracted from the web using keyword searchProceedings of the 2010 ACM SIGMOD International Conference on Management of data10.1145/1807167.1807256(819-830)Online publication date: 6-Jun-2010
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media