More Web Proxy on the site http://driver.im/

research-article

Entropy of search logs: how hard is search? with personalization? with backoff?

Authors:

Kenneth ChurchAuthors Info & Claims

WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

Pages 45 - 54

https://doi.org/10.1145/1341531.1341540

Published: 11 February 2008 Publication History

Abstract

How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN's search logs to estimate entropy. The perplexity is surprisingly small: millions, not billions.

Entropy is a powerful tool for sizing challenges and opportunities. How hard is search? How hard are query suggestion mechanisms like auto-complete? How much does personalization help? All these difficult questions can be answered by estimation of entropy from search logs.

What is the potential opportunity for personalization? In this paper, we propose a new way to personalize search, personalization with backoff. If we have relevant data for a particular user, we should use it. But if we don't, back off to larger and larger classes of similar users. As a proof of concept, we use the first few bytes of the IP address to define classes. The coefficients of each backoff class are estimated with an EM algorithm. Ideally, classes would be defined by market segments, demographics and surrogate variables such as time and geography

References

[1]

C. Anderson. The Long Tail: Why the Future of Business is Selling Less of More. Hyperion, 2006.

Digital Library

[2]

K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In Proceedings of the seventh international conference on World Wide Web 7, pages 379--388, 1998.

Digital Library

[3]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 310--318, 1996.

Digital Library

[4]

N. Chomsky. Syntactic Structures. The Hague/Paris: Mouton, 1957.

[5]

K. Church and W. Gale. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of english bigrams. Computer Speech and Language, 5(1):19--54, 1991.

[6]

K. Church and R. Mercer. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1--24, 1993.

Digital Library

[7]

K. Church and B. Thiesson. The wild thing! In Proceedings of the ACL 2005, pages 93--96, 2005.

Digital Library

[8]

C. Cortes and D. Pregibon. Signature-based methods for data streams. Data Min. Knowl. Discov., 5(3):167--182, 2001.

Digital Library

[9]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1--38, 1977.

[10]

D. Dhyani, W. K. Ng, and S. S. Bhowmick. A survey of web metrics. ACM Comput. Surv., 34(4):469--503, 2002.

Digital Library

[11]

R. Gallager. Claude E. Shannon: A retrospective on his life, work, and impact. IEEE Transactions on Information Theory, 47(7):2681--2695, 2001.

Digital Library

[12]

A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In Special interest tracks and posters of the 14th international conference on World Wide Web, pages 902--903, 2005.

Digital Library

[13]

G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pages 271--279, 2003.

Digital Library

[14]

T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142, 2002.

Digital Library

[15]

R. Jones and D. C. Fain. Query word deletion prediction. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 435-436, 2003.

Digital Library

[16]

R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proceedings of the 15th international conference on World Wide Web, pages 387--396, 2006.

Digital Library

[17]

S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speeech and Signal Processing, 35(3):400--401, 1987.

[18]

S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.

[19]

S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.

[20]

S. Lawrence and C. L. Giles. Accessibility of information on the web. Intelligence, 11(1):32--39, 2000.

Digital Library

[21]

F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 239--248, 2005.

Digital Library

[22]

C. Sagan. Billions and Billions: Thoughts on Life and Death at the Brink of the Millennium. New York: Random House, 1997.

[23]

C. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379--423, 623-656, 1948.

[24]

C. Shannon. Prediction and entropy of printed english. Bell Systems Technical Journal, 30:50--64, 1951.

[25]

X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43--50, 2005.

Digital Library

[26]

X. Shen, B. Tan, and C. Zhai. Ucair: a personalized search toolbar. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 681--681, 2005.

Digital Library

[27]

K. Sugiyama, K. Hatano, and M. Yoshikawa. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of the 13th international conference on World Wide Web, pages 675--684, 2004.

Digital Library

[28]

B. Tan, X. Shen, and C. Zhai. Mining long-term search history to improve search accuracy. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 718--723, 2006.

Digital Library

[29]

J. Teevan, S. T. Dumais, and E. Horvitz. Personalizing search via automated analysis of interests and activities. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 449--456, New York, NY, USA, 2005. ACM Press.

Digital Library

[30]

R. W. White and G. Marchionini. Examining the effectiveness of real-time query expansion. Information Processing and Management (IPM), 43(3):685--704, 2007.

Digital Library

[31]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334--342, 2001.

Digital Library

Cited By

Sangamithra BManjunath Swamy BSunil Kumar M(2022)Personalized Ranking Mechanism Using Yandex Dataset on Machine Learning ApproachesProceedings of the International Conference on Cognitive and Intelligent Computing10.1007/978-981-19-2350-0_61(629-639)Online publication date: 1-Nov-2022
https://doi.org/10.1007/978-981-19-2350-0_61
Pescetelli NRutherford ARahwan I(2021)Modularity and composite diversity affect the collective gathering of information onlineNature Communications10.1038/s41467-021-23424-112:1Online publication date: 27-May-2021
https://doi.org/10.1038/s41467-021-23424-1
Liu JLiu CBelkin N(2020)Personalization in text information retrievalJournal of the Association for Information Science and Technology10.1002/asi.2423471:3(349-369)Online publication date: 28-Jan-2020
https://dl.acm.org/doi/10.1002/asi.24234
Show More Cited By

Index Terms

Entropy of search logs: how hard is search? with personalization? with backoff?
1. Information systems
  1. Information retrieval

Recommendations

An entropy-based neighbor selection approach for collaborative filtering

Collaborative filtering is an emerging technology to deal with information overload problem guiding customers by offering recommendations on products of possible interest. Forming neighborhood of a user/item is the crucial part of the recommendation ...
Research on Entropy-based Collaborative Filtering Algorithm
ICEBE '07: Proceedings of the IEEE International Conference on e-Business Engineering

Based on the brief introduction to the user-based and item-based collaborative filtering algorithms, the problems related to the two algorithms are analyzed, and a new entropy-based recommendation algorithm is proposed. Aimed at the drawbacks of ...
OLAP on search logs: an infrastructure supporting data-driven applications in search engines
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Search logs, which contain rich and up-to-date information about users' needs and preferences, have become a critical data source for search engines. Recently, more and more data-driven applications are being developed in search engines based on search ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

February 2008

270 pages

ISBN:9781595939272

DOI:10.1145/1341531

General Chair:
Marc Najork
Microsoft, USA
,
Program Chairs:
Andrei Broder
Yahoo!, USA
,
Soumen Chakrabarti
IIT Bombay, India

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
960
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sangamithra BManjunath Swamy BSunil Kumar M(2022)Personalized Ranking Mechanism Using Yandex Dataset on Machine Learning ApproachesProceedings of the International Conference on Cognitive and Intelligent Computing10.1007/978-981-19-2350-0_61(629-639)Online publication date: 1-Nov-2022
https://doi.org/10.1007/978-981-19-2350-0_61
Pescetelli NRutherford ARahwan I(2021)Modularity and composite diversity affect the collective gathering of information onlineNature Communications10.1038/s41467-021-23424-112:1Online publication date: 27-May-2021
https://doi.org/10.1038/s41467-021-23424-1
Liu JLiu CBelkin N(2020)Personalization in text information retrievalJournal of the Association for Information Science and Technology10.1002/asi.2423471:3(349-369)Online publication date: 28-Jan-2020
https://dl.acm.org/doi/10.1002/asi.24234
Steiner M(2019)The Influence of Backstories on Queries with Varying Levels of Intent in Task-Based Specialised Information RetrievalAdvances in Information Retrieval10.1007/978-3-030-15719-7_52(375-379)Online publication date: 7-Apr-2019
https://doi.org/10.1007/978-3-030-15719-7_52
Mutlu BVeas ETrattner CDolog PVojtas PBonchi FHelic D(2017)Tags, Titles or Q&As?Proceedings of the 28th ACM Conference on Hypertext and Social Media10.1145/3078714.3078741(265-274)Online publication date: 4-Jul-2017
https://dl.acm.org/doi/10.1145/3078714.3078741
Ma SYang HShi M(2017)Creative Travel Idea Generation Based on Semantic Web and Lateral Thinking2017 14th International Symposium on Pervasive Systems, Algorithms and Networks & 2017 11th International Conference on Frontier of Computer Science and Technology & 2017 Third International Symposium of Creative Computing (ISPAN-FCST-ISCC)10.1109/ISPAN-FCST-ISCC.2017.91(462-468)Online publication date: Jun-2017
https://doi.org/10.1109/ISPAN-FCST-ISCC.2017.91
Ma SLi SYang H(2016)Utilising Creative Computing and data mining techniques to analyse queries in a meta-search system2016 22nd International Conference on Automation and Computing (ICAC)10.1109/IConAC.2016.7604953(402-407)Online publication date: Sep-2016
https://doi.org/10.1109/IConAC.2016.7604953
Popescu PSluşanschi EIancu VPop F(2016)A new upper bound for Shannon entropy. A novel approach in modeling of Big Data applicationsConcurrency and Computation: Practice & Experience10.1002/cpe.344428:2(351-359)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1002/cpe.3444
Lee CCraswell NMurdock VBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)Inter-Category Variation in Location SearchProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767797(863-866)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767797
Shokouhi MGuo QBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)From Queries to CardsProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767705(695-704)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767705
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten