[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2566486.2567965acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Fast topic discovery from web search streams

Published: 07 April 2014 Publication History

Abstract

Web search involves voluminous data streams that record millions of users' interactions with the search engine. Recently latent topics in web search data have been found to be critical for a wide range of search engine applications such as search personalization and search history warehousing. However, the existing methods usually discover latent topics from web search data in an offline and retrospective fashion. Hence, they are increasingly ineffective in the face of the ever-increasing web search data that accumulate in the format of online streams. In this paper, we propose a novel probabilistic topic model, the Web Search Stream Model (WSSM), which is delicately calibrated for handling two salient features of the web search data: it is in the format of streams and in massive volume. We further propose an efficient parameter inference method, the Stream Parameter Inference (SPI) to efficiently train WSSM with massive web search streams. Based on a large-scale search engine query log, we conduct extensive experiments to verify the effectiveness and efficiency of WSSM and SPI. We observe that WSSM together with SPI discovers latent topics from web search streams faster than the state-of-the-art methods while retaining a comparable topic modeling accuracy.

References

[1]
Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J Smola, phScalable inference in latent variable models, WSDM, 2012.
[2]
Hagai Attias, Inferring parameters and structure of latent variable models by variational bayes, UAI, 1999.
[3]
Ricardo Baeza-Yates and Yoelle Maarek, (big) usage data in web search, WSDM, 2013.
[4]
Léon Bottou, Online learning and stochastic approximations, Online learning in neural networks (1998).
[5]
Huanhuan Cao, Daxin Jiang, Jian Pei, Enhong Chen, and Hang Li, Towards context-aware search by learning a very large variable length hidden markov model from search logs, WWW, 2009.
[6]
M.J. Carman, F. Crestani, M. Harvey, and M. Baillie, Towards query log based personalization using topic models, CIKM, 2010.
[7]
Ludmila Cherkasova, Scheduling strategy to improve response time for web applications, High-Performance Computing and Networking, 1998.
[8]
Graham Cormode and S Muthukrishnan, Summarizing and mining skewed data streams, SIAM, 2005, pp. 44--55.
[9]
Muthukrishnan-S Cormode, Graham, Approximating data with the count-min data structure, IEEE Software (2012).
[10]
Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White, Evaluating implicit measures to improve web search, TOIS (2005).
[11]
Gregor Heinrich, Parameter estimation for text analysis, Web: http://www. arbylon. net/publications/text-est. pdf (2005).
[12]
Matthew Hoffman, Francis R Bach, and David M Blei, Online learning for latent dirichlet allocation, NIPS, 2010.
[13]
Liangjie Hong, Amr Ahmed, Siva Gurumurthy, Alexander J Smola, and Kostas Tsioutsiouliklis, Discovering geographical topics in the twitter stream, WWW, 2012.
[14]
J. Huang and E. N. Efthimiadis, Analyzing and evaluating query reformulation strategies in web search logs, CIKM, 2009.
[15]
Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng, G-wstd: a framework for geographic web search topic discovery, CIKM, 2012.
[16]
Rosie Jones and Kristina Lisa Klinkner, Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs, 17th ACM conference on Information and knowledge management, 2008.
[17]
D. Kang, D. Jiang, J. Pei, Z. Liao, X. Sun, and H. J. Choi, Multidimensional mining of large-scale search logs: a topic-concept cube approach, WSDM, 2011.
[18]
Dongyeop Kang, Daxin Jiang, Jian Pei, Zhen Liao, Xiaohui Sun, and Ho-Jin Choi, Multidimensional mining of large-scale search logs: a topic-concept cube approach, WSDM, 2011.
[19]
Jey Han Lau, Nigel Collier, and Timothy Baldwin, On-line trend analysis with topic models:twitter trends detection topic model online., COLING, 2012.
[20]
Kevin P Murphy, Yair Weiss, and Michael I Jordan, Loopy belief propagation for approximate inference: An empirical study, UAI, 1999.
[21]
David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed algorithms for topic models, The Journal of Machine Learning Research (2009).
[22]
Thomas Niemann, Sorting and searching algorithms: A cookbook, Thomas Niemann, 2006.
[23]
Patrick Pantel, Thomas Lin, and Michael Gamon, Mining entity types from query logs via user intent modeling, ACL, 2012.
[24]
Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling, Fast collapsed gibbs sampling for latent dirichlet allocation, SIGKDD, 2008.
[25]
Daniel E Rose and Danny Levinson, Understanding user goals in web search, WWW, 2004.
[26]
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, The author-topic model for authors and documents, UAI, 2004.
[27]
Yelong Shen, Jun Yan, Shuicheng Yan, Lei Ji, Ning Liu, and Zheng Chen, Sparse hidden-dynamics conditional random fields for user intent understanding, WWW, 2011.
[28]
S. Sizov, Geofolk: latent spatial semantics in web 2.0 social media, WSDM, 2010.
[29]
Wei Song, Yu Zhang, Ting Liu, and Sheng Li, Bridging topic modeling and personalized search, COLING, 2010.
[30]
Mark Steyvers and Tom Griffiths, Probabilistic topic models, Handbook of latent semantic analysis (2007).
[31]
Yee W Teh, David Newman, and Max Welling, A collapsed variational bayesian inference algorithm for latent dirichlet allocation, NIPS, 2006.
[32]
C. Wang, J. Wang, X. Xie, and W. Y. Ma, Mining geographic knowledge using location aware topic model, GIR, 2007.
[33]
Kuansan Wang, Nikolas Gloy, and Xiaolong Li, Inferring search behaviors using partially observable markov (pom) model, WSDM, 2010.
[34]
X. Wang and A. McCallum, Topics over time: a non-markov continuous-time model of topical trends, SIGKDD, 2006.
[35]
Limin Yao, David Mimno, and Andrew McCallum, Efficient methods for topic model inference on streaming document collections, SIGKDD, 2009.
[36]
Jia Zeng, W Cheung, and Jiming Liu, Learning topic models by belief propagation, PAMI (2011).
[37]
Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao, Online belief propagation for topic modeling, arXiv (2012).
[38]
Ke Zhai, Jordan Boyd-Graber, and Nima Asadi, Using variational inference and mapreduce to scale topic modeling, arXiv (2011).

Cited By

View all
  • (2019)A Knowledge-Based Semisupervised Hierarchical Online Topic Detection FrameworkIEEE Transactions on Cybernetics10.1109/TCYB.2018.284150449:9(3307-3321)Online publication date: Sep-2019
  • (2019)A new anchor word selection method for the separable topic discoveryWIREs Data Mining and Knowledge Discovery10.1002/widm.13139:5Online publication date: 13-May-2019
  • (2017)Learning Latent Topics from the Word Co-occurrence NetworkTheoretical Computer Science10.1007/978-981-10-6893-5_2(18-30)Online publication date: 14-Oct-2017
  • Show More Cited By

Index Terms

  1. Fast topic discovery from web search streams

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '14: Proceedings of the 23rd international conference on World wide web
    April 2014
    926 pages
    ISBN:9781450327442
    DOI:10.1145/2566486

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 April 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. probabilistic topic model
    2. query log
    3. web search

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    WWW '14
    Sponsor:
    • IW3C2

    Acceptance Rates

    WWW '14 Paper Acceptance Rate 84 of 645 submissions, 13%;
    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)A Knowledge-Based Semisupervised Hierarchical Online Topic Detection FrameworkIEEE Transactions on Cybernetics10.1109/TCYB.2018.284150449:9(3307-3321)Online publication date: Sep-2019
    • (2019)A new anchor word selection method for the separable topic discoveryWIREs Data Mining and Knowledge Discovery10.1002/widm.13139:5Online publication date: 13-May-2019
    • (2017)Learning Latent Topics from the Word Co-occurrence NetworkTheoretical Computer Science10.1007/978-981-10-6893-5_2(18-30)Online publication date: 14-Oct-2017
    • (2016)Dynamic Query Intent Prediction from a Search Log StreamInternational Journal of Information Retrieval Research10.4018/IJIRR.20160401046:2(66-85)Online publication date: 1-Apr-2016
    • (2016)Cross-Lingual Topic Discovery From Multilingual Search Engine Query LogACM Transactions on Information Systems10.1145/295623535:2(1-28)Online publication date: 21-Sep-2016
    • (2016)Word network topic modelKnowledge and Information Systems10.1007/s10115-015-0882-z48:2(379-398)Online publication date: 1-Aug-2016
    • (2015)Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core SystemsProceedings of the 24th International Conference on World Wide Web10.1145/2736277.2741106(669-679)Online publication date: 18-May-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media