More Web Proxy on the site http://driver.im/

research-article

Web Search Clustering and Labeling with Hidden Topics

Authors:

Xuan-Hieu Phan,

Susumu Horiguchi,

Thu-Trang Nguyen,

Quang-Thuy HaAuthors Info & Claims

ACM Transactions on Asian Language Information Processing (TALIP), Volume 8, Issue 3

Article No.: 12, Pages 1 - 40

https://doi.org/10.1145/1568292.1568295

Published: 01 August 2009 Publication History

Abstract

Web search clustering is a solution to reorganize search results (also called “snippets”) in a more convenient way for browsing. There are three key requirements for such post-retrieval clustering systems: (1) the clustering algorithm should group similar documents together; (2) clusters should be labeled with descriptive phrases; and (3) the clustering system should provide high-quality clustering without downloading the whole Web page.

This article introduces a novel framework for clustering Web search results in Vietnamese which targets the three above issues. The main motivation is that by enriching short snippets with hidden topics from huge resources of documents on the Internet, it is able to cluster and label such snippets effectively in a topic-oriented manner without concerning whole Web pages. Our approach is based on recent successful topic analysis models, such as Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the framework is that we collect a very large external data collection called “universal dataset,” and then build a clustering system on both the original snippets and a rich set of hidden topics discovered from the universal data collection. This can be seen as a richer representation of snippets to be clustered. We carry out careful evaluation of our method and show that our method can yield impressive clustering quality.

References

[1]

Andrieu, C., Freitas, N., Doucet, A., and Jordan, M. 2003. An introduction to mcmc for machine learning. Mach. Learn. 50, 5--43.

[2]

Baamboo. 2008. Vietnamese search engine. http://mp3.baamboo.coms.

[3]

Bagga, A. and Baldwin, B. 1998. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational Linguistics (ACL’98). 79--85.

Digital Library

[4]

Banerjee, S. and Pedersen, T. 2003. The design, implementation and use of the ngram statistics. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics. 370--381.

Digital Library

[5]

Banerjee, S., Ramanathan, K., and Gupta, A. 2007. Clustering short texts using wikipedia. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07).

Digital Library

[6]

Blei, D. and Lafferty, J. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06).

Digital Library

[7]

Blei, D. and Lafferty, J. 2007. A correlated topic model of science. Ann. Appl. Stat. 1, 17--35.

[8]

Blei, D., Ng, A., and Jordan, M. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022.

[9]

Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the International World Wide Web Conference (WWW’07). 757--766.

Digital Library

[10]

Cai, L. and Hofmann, T. 2003. Text categorization by boosting automatically extracted concepts. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’03).

Digital Library

[11]

Chen, H. and Dumais, S. 2001. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the International Conference on Human Factors in Computing Systems (CHI’01). 145--152.

Digital Library

[12]

Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tokey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 318--329.

Digital Library

[13]

Deerwester, S., Furnas, G., and Landauer, T. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 391--407.

[14]

Ferragina, P. and Gulli, A. 2005. A personalized search engine based on Web-snippet hierarchical clustering. In Proceedings of the International World Wide Web Conference (WWW’05). 801--810.

Digital Library

[15]

Garilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07).

Digital Library

[16]

Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F. 2006. Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. Lecture Notes in Computer Science, vol. 4209, 25--36.

Digital Library

[17]

Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Natl. Acad. Sci. 101, 5228--5235.

[18]

Heinrich, G. 2005. Parameter estimation for text analysis. Tech. rep., University of Leipzig and vsonix GmbH.

[19]

Hofmann, T. 1999. Probabilistic lsa. In Proceedings of the Conference on Uncertainly in Artificial Intelligence (UAI’99).

[20]

Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., and Cheng, Q. Y. Z. 2008. Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). 179--186.

Digital Library

[21]

Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the Web. SIGIR Forum. 32, 1, 5--17.

Digital Library

[22]

Kotsiantis, S. and Pintelas, P. E. 2004. Recent advances in clustering: A brief survey. WSEAS Trans. Inform. Sci. Appl. 1, 1, 73--81.

[23]

Manning, C. D. and Schutze, H. 1999. Foundations of Statistic Natural Language Processing. MIT Press.

Digital Library

[24]

Mei, Q., Shen, X., and Zhai, C. 2007. Automatic labeling of multinomial topic models. In Proceeding of the Knowledge Discovery and Data Mining Conference (KDD’07).

Digital Library

[25]

Ngo, C.-L. 2003. A tolerance rough set approach to clustering Web search results. Master’s thesis, Warsaw University.

[26]

Nguyen, C.-T., Nguyen, T.-K., Phan, X. H., Nguyen, L. M., and Ha, Q. T. 2006. Vietnamese word segmentation with CRFs and SVMs: An investigation. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Compuation (PACLIC’06). 215--222.

[27]

Osinski, S. 2003. An algorithm for clustering Web search result. Master’s thesis. Poznan University of Technology, Poland.

[28]

Phan, X. H., Nguyen, L. M., and Horiguchi, S. 2008. Learning to classify short and sparse text and Web with hidden topics from large-scale data collections. In Proceedings of the International World Wide Web Conference (WWW’08).

Digital Library

[29]

Popescul, A. and Ungar, L. 2000. Automatic labeling of document clusters. http://www.cis.upenn.edu/~popescul/Publications/popesculcolabeling.pdf.

[30]

Sahami, M. and Heilman, T. 2006. A Web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the International World Wide Web Conference (WWW’06).

Digital Library

[31]

Schonhofen, P. 2006. Identifying document topics using the wikipedia category network. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI’06). 456--462.

Digital Library

[32]

Socbay. 2008. Vietnamese search engine. http://www.socbay.com.

[33]

Treeratpituk, P. and Callan, J. 2006. Automatically labeling hierarchical clusters. In Proceedings of the International Conference on Digital Government Research (DGRC’06).

Digital Library

[34]

Vivisimo. 2008. Clustering engine. http://vivisimo.com/.

[35]

Vnnic. 2008. Vietnam Internet Center. http://www.thongkeinternet.vn.

[36]

Wang, X., McCallum, A., and Wei, X. 2007. Topical n-grams: Phrase and topic discovery with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining (DM’07). 697--702.

Digital Library

[37]

Wikipedia. 2008. Latent semantic analysis. http://en.wikipedia.org/wiki.

[38]

Xalo. 2008. Vietnamese search engine. http://xalo.vn.

[39]

Yih, W. and Meek, C. 2007. Improving similarity measures for short segments of text. In Proceedings of the National Conference on Artificial Intelligence (AAAI’07).

Digital Library

[40]

Zamir, O. and Etzioni, O. 1999. Grouper: A dynamic clustering interface to Web search results. Comput. Netw. 31, 11-16, 1361--1374.

Digital Library

[41]

Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., and Ma, J. 2004. Learning to cluster Web search results. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04).

Digital Library

[42]

Zing. 2008. Vietnamese Web site directory. http://directory.zing.vn.

Cited By

Puspitaningrum DFauzi Susilo BPagua JErlansari AAndreswari DEfendi RPrasetya I(2016)An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search EngineInformation Retrieval Technology10.1007/978-3-319-28940-3_22(279-291)Online publication date: 22-Jan-2016
https://doi.org/10.1007/978-3-319-28940-3_22
Wei BLiu JZheng QZhang WWang CWu B(2015)DF-MinerKnowledge-Based Systems10.1016/j.knosys.2015.01.00177:C(80-91)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1016/j.knosys.2015.01.001
Li ZLi JLiao YWen STang J(2015)Labeling clusters from both linguistic and statistical perspectivesKnowledge-Based Systems10.1016/j.knosys.2014.12.01976:1(219-227)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1016/j.knosys.2014.12.019
Show More Cited By

Index Terms

Web Search Clustering and Labeling with Hidden Topics
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Formal concept analysis for topic detection

We propose a novel application of FCA-based methods for Topic Detection, overcoming traditional problems of the clustering and classification techniques.We achieve state-of-the-art results for the topic detection task at Replab 2013.We propose an ...
Search result presentation based on faceted clustering
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

We propose a competence partitioning strategy for Web search result presentation: the unmodified head of a ranked result list is combined with a clustering of documents from the result list tail. We identify two principles to which such a clustering ...
A new approach to search result clustering and labeling
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Search engines present query results as a long ordered list of web snippets divided into several pages. Post-processing of retrieval results for easier access of desired information is an important research problem. In this paper, we present a novel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing

ACM Transactions on Asian Language Information Processing Volume 8, Issue 3

August 2009

81 pages

ISSN:1530-0226

EISSN:1558-3430

DOI:10.1145/1568292

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2009

Accepted: 01 May 2009

Revised: 01 April 2009

Received: 01 September 2008

Published in TALIP Volume 8, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
949
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Puspitaningrum DFauzi Susilo BPagua JErlansari AAndreswari DEfendi RPrasetya I(2016)An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search EngineInformation Retrieval Technology10.1007/978-3-319-28940-3_22(279-291)Online publication date: 22-Jan-2016
https://doi.org/10.1007/978-3-319-28940-3_22
Wei BLiu JZheng QZhang WWang CWu B(2015)DF-MinerKnowledge-Based Systems10.1016/j.knosys.2015.01.00177:C(80-91)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1016/j.knosys.2015.01.001
Li ZLi JLiao YWen STang J(2015)Labeling clusters from both linguistic and statistical perspectivesKnowledge-Based Systems10.1016/j.knosys.2014.12.01976:1(219-227)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1016/j.knosys.2014.12.019
Alghamdi HSelamat AAbdul Karim N(2014)Arabic web pages clustering and annotation using semantic class featuresJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2014.06.00226:4(388-397)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1016/j.jksuci.2014.06.002
Di Marco ANavigli R(2013)Clustering and Diversifying Web Search Results with Graph-Based Word Sense InductionComputational Linguistics10.1162/COLI_a_0014839:3(709-754)Online publication date: Sep-2013
https://doi.org/10.1162/COLI_a_00148
Nguyen CKaothanthong NTokuyama TPhan X(2013)A feature-word-topic model for image annotation and retrievalACM Transactions on the Web10.1145/2516633.25166347:3(1-24)Online publication date: 30-Sep-2013
https://dl.acm.org/doi/10.1145/2516633.2516634
Mirylenka DPasserini AHe QIyengar ANejdl WPei JRastogi R(2013)Navigating the topical structure of academic search results via the Wikipedia category networkProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505621(891-896)Online publication date: 27-Oct-2013
https://dl.acm.org/doi/10.1145/2505515.2505621
Hung LAnh NDang NTrong GHluchy LQuyet TCastelli EDuc KChi MTran V(2012)Improving Vietnamese web page clustering by combining neighbors' content and using iterative feature selectionProceedings of the 3rd Symposium on Information and Communication Technology10.1145/2350716.2350726(47-54)Online publication date: 23-Aug-2012
https://dl.acm.org/doi/10.1145/2350716.2350726
Feng SWang DYu GGao WWong K(2011)Extracting common emotions from blogs based on fine-grained sentiment clusteringKnowledge and Information Systems10.5555/3225632.322575827:2(281-302)Online publication date: 1-May-2011
https://dl.acm.org/doi/10.5555/3225632.3225758
Di Marco ANavigli R(2011)Clustering web search results with maximum spanning treesProceedings of the 12th international conference on Artificial intelligence around man and beyond10.5555/2041977.2042002(201-212)Online publication date: 15-Sep-2011
https://dl.acm.org/doi/10.5555/2041977.2042002
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents