[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1341531.1341540acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Entropy of search logs: how hard is search? with personalization? with backoff?

Published: 11 February 2008 Publication History

Abstract

How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN's search logs to estimate entropy. The perplexity is surprisingly small: millions, not billions.
Entropy is a powerful tool for sizing challenges and opportunities. How hard is search? How hard are query suggestion mechanisms like auto-complete? How much does personalization help? All these difficult questions can be answered by estimation of entropy from search logs.
What is the potential opportunity for personalization? In this paper, we propose a new way to personalize search, personalization with backoff. If we have relevant data for a particular user, we should use it. But if we don't, back off to larger and larger classes of similar users. As a proof of concept, we use the first few bytes of the IP address to define classes. The coefficients of each backoff class are estimated with an EM algorithm. Ideally, classes would be defined by market segments, demographics and surrogate variables such as time and geography

References

[1]
C. Anderson. The Long Tail: Why the Future of Business is Selling Less of More. Hyperion, 2006.
[2]
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In Proceedings of the seventh international conference on World Wide Web 7, pages 379--388, 1998.
[3]
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 310--318, 1996.
[4]
N. Chomsky. Syntactic Structures. The Hague/Paris: Mouton, 1957.
[5]
K. Church and W. Gale. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of english bigrams. Computer Speech and Language, 5(1):19--54, 1991.
[6]
K. Church and R. Mercer. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1--24, 1993.
[7]
K. Church and B. Thiesson. The wild thing! In Proceedings of the ACL 2005, pages 93--96, 2005.
[8]
C. Cortes and D. Pregibon. Signature-based methods for data streams. Data Min. Knowl. Discov., 5(3):167--182, 2001.
[9]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39:1--38, 1977.
[10]
D. Dhyani, W. K. Ng, and S. S. Bhowmick. A survey of web metrics. ACM Comput. Surv., 34(4):469--503, 2002.
[11]
R. Gallager. Claude E. Shannon: A retrospective on his life, work, and impact. IEEE Transactions on Information Theory, 47(7):2681--2695, 2001.
[12]
A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In Special interest tracks and posters of the 14th international conference on World Wide Web, pages 902--903, 2005.
[13]
G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pages 271--279, 2003.
[14]
T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142, 2002.
[15]
R. Jones and D. C. Fain. Query word deletion prediction. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 435-436, 2003.
[16]
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proceedings of the 15th international conference on World Wide Web, pages 387--396, 2006.
[17]
S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speeech and Signal Processing, 35(3):400--401, 1987.
[18]
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, 1998.
[19]
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400(6740):107--109, 1999.
[20]
S. Lawrence and C. L. Giles. Accessibility of information on the web. Intelligence, 11(1):32--39, 2000.
[21]
F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 239--248, 2005.
[22]
C. Sagan. Billions and Billions: Thoughts on Life and Death at the Brink of the Millennium. New York: Random House, 1997.
[23]
C. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379--423, 623-656, 1948.
[24]
C. Shannon. Prediction and entropy of printed english. Bell Systems Technical Journal, 30:50--64, 1951.
[25]
X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43--50, 2005.
[26]
X. Shen, B. Tan, and C. Zhai. Ucair: a personalized search toolbar. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 681--681, 2005.
[27]
K. Sugiyama, K. Hatano, and M. Yoshikawa. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of the 13th international conference on World Wide Web, pages 675--684, 2004.
[28]
B. Tan, X. Shen, and C. Zhai. Mining long-term search history to improve search accuracy. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 718--723, 2006.
[29]
J. Teevan, S. T. Dumais, and E. Horvitz. Personalizing search via automated analysis of interests and activities. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 449--456, New York, NY, USA, 2005. ACM Press.
[30]
R. W. White and G. Marchionini. Examining the effectiveness of real-time query expansion. Information Processing and Management (IPM), 43(3):685--704, 2007.
[31]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334--342, 2001.

Cited By

View all
  • (2022)Personalized Ranking Mechanism Using Yandex Dataset on Machine Learning ApproachesProceedings of the International Conference on Cognitive and Intelligent Computing10.1007/978-981-19-2350-0_61(629-639)Online publication date: 1-Nov-2022
  • (2021)Modularity and composite diversity affect the collective gathering of information onlineNature Communications10.1038/s41467-021-23424-112:1Online publication date: 27-May-2021
  • (2020)Personalization in text information retrievalJournal of the Association for Information Science and Technology10.1002/asi.2423471:3(349-369)Online publication date: 28-Jan-2020
  • Show More Cited By

Index Terms

  1. Entropy of search logs: how hard is search? with personalization? with backoff?

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining
    February 2008
    270 pages
    ISBN:9781595939272
    DOI:10.1145/1341531
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 February 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. demographics
    2. entropy
    3. personalization with backoff
    4. search difficulty
    5. search log

    Qualifiers

    • Research-article

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Personalized Ranking Mechanism Using Yandex Dataset on Machine Learning ApproachesProceedings of the International Conference on Cognitive and Intelligent Computing10.1007/978-981-19-2350-0_61(629-639)Online publication date: 1-Nov-2022
    • (2021)Modularity and composite diversity affect the collective gathering of information onlineNature Communications10.1038/s41467-021-23424-112:1Online publication date: 27-May-2021
    • (2020)Personalization in text information retrievalJournal of the Association for Information Science and Technology10.1002/asi.2423471:3(349-369)Online publication date: 28-Jan-2020
    • (2019)The Influence of Backstories on Queries with Varying Levels of Intent in Task-Based Specialised Information RetrievalAdvances in Information Retrieval10.1007/978-3-030-15719-7_52(375-379)Online publication date: 7-Apr-2019
    • (2017)Tags, Titles or Q&As?Proceedings of the 28th ACM Conference on Hypertext and Social Media10.1145/3078714.3078741(265-274)Online publication date: 4-Jul-2017
    • (2017)Creative Travel Idea Generation Based on Semantic Web and Lateral Thinking2017 14th International Symposium on Pervasive Systems, Algorithms and Networks & 2017 11th International Conference on Frontier of Computer Science and Technology & 2017 Third International Symposium of Creative Computing (ISPAN-FCST-ISCC)10.1109/ISPAN-FCST-ISCC.2017.91(462-468)Online publication date: Jun-2017
    • (2016)Utilising Creative Computing and data mining techniques to analyse queries in a meta-search system2016 22nd International Conference on Automation and Computing (ICAC)10.1109/IConAC.2016.7604953(402-407)Online publication date: Sep-2016
    • (2016)A new upper bound for Shannon entropy. A novel approach in modeling of Big Data applicationsConcurrency and Computation: Practice & Experience10.1002/cpe.344428:2(351-359)Online publication date: 1-Feb-2016
    • (2015)Inter-Category Variation in Location SearchProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767797(863-866)Online publication date: 9-Aug-2015
    • (2015)From Queries to CardsProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767705(695-704)Online publication date: 9-Aug-2015
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media