More Web Proxy on the site http://driver.im/

article

Towards a unified approach to document similarity search using manifold-ranking of blocks

Authors:

Jianguo XiaoAuthors Info & Claims

Information Processing and Management: an International Journal, Volume 44, Issue 3

Pages 1032 - 1048

https://doi.org/10.1016/j.ipm.2007.07.012

Published: 01 May 2008 Publication History

Abstract

Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.

References

[1]

Allan, J., Carbonell, J., Doddington, G., Yamron, J. P. & Yang, Y. (1998). Topic detection and tracking pilot study: Final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (pp. 194-218).

[2]

Modern information retrival. ACM Press and Addison Wesley.

Digital Library

[3]

Cai, D., Yu, S., Wen, J. -R., & Ma, W. -Y. (2003). VIPS: A vision based page segmentation algorithm. Microsoft technical report, MSR-TR-2003-79.

[4]

Cai, D., He, X., Wen, J. -R., & Ma, W. -Y. (2004). Block-level link analysis. In Proceedings of the 27th annual international ACM SIGIR conference (SIGIR'2004).

Digital Library

[5]

Callan, J. (1994). Passage-level evidence in document retrieval, In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 302-310).

Digital Library

[6]

Chen, J., Zhou, B., Shi, J., Zhang, H. -J., & Qiu, F. (2001). Function-based object model towards website adaptation. In Proceedings of the 10th world wide web conference (WWW10).

Digital Library

[7]

Choi, F. (1999). JTextTile: A free platform independent text segmentation algorithm. http://www.cs.man.ac.uk/~choif.

[8]

Language modeling for information retrieval. Kluwer Academic Publishers.

Digital Library

[9]

Cruz, I. F., Borisov, S., Marks, M. A., Webb, T. R. (1998). Measuring structural similarity among web documents: Preliminary results. In Proceedings of the 7th international conference on electronic publishing (pp. 513-524).

Digital Library

[10]

Dean, J., & Henzinger, M. R. (1999). Finding related pages in the World Wide Web. In Proceedings of the eighth international conference on world wide web (pp. 1467-1479).

Digital Library

[11]

Diaz, F. (2005). Regularizing ad hoc retrieval scores. In Proceedings of the 14th ACM international conference on information and knowledge management (CIKM'2005).

Digital Library

[12]

LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research (JAIR). v22. 457-479.

Digital Library

[13]

Fogaras, D., & Rácz, B. (2004). Scaling link-based similarity search. Technical report.

[14]

Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P. (2002). Evaluating strategies for similarity search on the Web. In Proceedings of WWW2002 (pp. 432-442).

Digital Library

[15]

Hearst, M. A. (1994). Multi-paragraph segmentation of expository text. In Proceedings of the 32PndPmeeting of the association for computational linguistics, Los Cruces, NM.

Digital Library

[16]

TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics. v23 i1. 33-64.

Digital Library

[17]

Hearst, M. A., & Pedersen, O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR1996 (pp. 76-84).

Digital Library

[18]

Hearst, M. A., & Plaunt, C. (1993). Subtopic structuring for full-length document access. In Proceedings of the 16PthPannual international ACM/SIGIR conference, Pittsburgh, PA.

Digital Library

[19]

Iwayama, M., Fujii, A., Kando, N., Marukawa, Y. (2003). An empirical study on retrieval models for different document genres: Patents and newspaper articles. In Proceedings of SIGIR2003.

Digital Library

[20]

Jeh, G., & Widom, J. (2002). SimRank: A measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining.

Digital Library

[21]

Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S. (2003). A bag of paths model for measuring structural similarity in web documents. In Proceedings of the 9th ACM SIGKDD conference (pp. 577-582).

Digital Library

[22]

Passage retrieval revisited. ACM SIGIR Forum. v31 iSI. 178-185.

Digital Library

[23]

Kaufmann, S. (1999). Cohesion and collocation: Using context vectors in text segmentation, In Proceedings of the 37th conference on association for computational linguistics (pp. 591-595).

Digital Library

[24]

Authoritative sources in a hyperlinked environment. Journal of the ACM. v46 i5. 604-632.

Digital Library

[25]

Kovacevic, M., Diligenti, M., Gori, M., & Milutinovic, V. (2002). Recognition of common areas in a web page using visual information: A possible application in a page classification. In Proceedings of 2002 IEEE international conference on data mining (ICDM'02), Maebashi City, Japan.

Digital Library

[26]

Kurland, O., Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR2005 (pp. 306-313).

Digital Library

[27]

Kurland, O., Lee, L. (2006). Respect my authority! HITS without hyperlinks: utilizing cluster-based language models. In Proceedings of SIGIR2006.

Digital Library

[28]

Kurland, O., Lee, L., Domshlak, C. (2005). Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proceedings of SIGIR2005.

Digital Library

[29]

Lin, Z., Lyu, M.R., King, I. (2006). PageSim: A novel link-based measure of web page similarity. In Proceeding of the 15th international world wide web conference.

Digital Library

[30]

Mihalcea, R., Tarau, P. (2004). Textrank: Bringing order into texts. In Proceedings of EMNLP 2004 (pp. 404-411).

[31]

Otterbacher, J., Erkan, G., Radev, D. (2005). Using random walks for question-focused sentence retrieval. In Proc. HLT/EMNLP 2005.

Digital Library

[32]

Page, L., Brin, S., Motwani, R., Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web, Technical report. Stanford, CA: Stanford University.

[33]

An algorithm for suffix stripping. Program. v14 i3. 130-137.

[34]

Qin, T., Liu, T. -Y., Zhang, X. -D., Chen, Z., Ma, W. -Y. (2005). A study of relevance propagation for web search. In Proceedings of SIGIR2005.

Digital Library

[35]

Robertson, S., Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of the 17th international ACM/SIGIR conference on research and development in information retrieval (pp. 232-241).

Digital Library

[36]

Robertson, S., Walker, S., Beaulieu, M. (1999) Okapi at TREC-7: Automatic ad hoc, filtering, VLC and filtering tracks. In Proceedings of TREC'99.

[37]

The SMART retrieval system: Experiments in automatic document processing. Prentice-Hall.

Digital Library

[38]

A vector space model for automatic indexing. Communications of the ACM. v18 i11. 613-620.

Digital Library

[39]

Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In Proceedings of SIGIR'96.

Digital Library

[40]

Smucker, M.D., & Allan, J. (2006). Find-similar: Similarity browsing as a search tool. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR'2006) (pp. 461-468).

Digital Library

[41]

Song, R., Liu, H., Wen, J. -R., & Ma, W. -Y. (2004). Learning block importance models for web pages. In Proceeding of the thirteenth world wide web conference (WWW 2004) (pp. 203-211).

Digital Library

[42]

Tombros, A., & Ali, Z. (2005). Factors affecting web page similarity. In Proceedings of ECIR2005.

Digital Library

[43]

Information retrieval. Butterworths, London.

[44]

Xue, G. -R., Zeng, H. -J., Chen, Z., & Yu, Y. (2004). MRSSA: An iterative algorithm for similarity spreading over interrelated objects. In Proceedings of CIKM2004.

Digital Library

[45]

Yu, S., Cai, D., Wen, J. -R., Ma, W. -Y. (2003). Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In Proceedings of the twelfth international world wide web conference (WWW2003).

Digital Library

[46]

Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., Ma, W. -Y. (2005). Improving web search results using affinity graph. In Proceedings of SIGIR2005.

Digital Library

[47]

Zhou, D., Weston, J., Gretton, A., Bousquet, O., Schölkopf, B. (2003). Ranking on data manifolds. In Proceedings of NIPS-2003.

[48]

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., Schölkopf, B. (2003). Learning with local and global consistency. In Proceedings of NIPS-2003.

Cited By

Lin DWei VWong R(2023)Effective and Scalable Manifold Ranking-Based Image Retrieval with Output BoundACM Transactions on Knowledge Discovery from Data10.1145/356557417:5(1-31)Online publication date: 7-Apr-2023
https://dl.acm.org/doi/10.1145/3565574
Sheetrit EShtok AKurland O(2020)A passage-based approach to learning to rank documentsInformation Retrieval10.1007/s10791-020-09369-x23:2(159-186)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1007/s10791-020-09369-x
Chen LWu CDai Y(2015)Find relationship among applicationsInternational Journal of Networking and Virtual Organisations10.1504/IJNVO.2015.06930015:1(80-98)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1504/IJNVO.2015.069300
Show More Cited By

Index Terms

Towards a unified approach to document similarity search using manifold-ranking of blocks

Recommendations

Document similarity search based on manifold-ranking of texttiles
AIRS'06: Proceedings of the Third Asia conference on Information Retrieval Technology

Document similarity search aims to find documents similar to a query document in a text corpus and return a ranked list of similar documents. Most existing approaches to document similarity search compute similarity scores between the query and the ...
Block-based similarity search on the web using manifold-ranking
WISE'06: Proceedings of the 7th international conference on Web Information Systems

Similarity search on the web aims to find web pages similar to a query page and return a ranked list of similar web pages. The popular approach to web page similarity search is to calculate the pairwise similarity between web pages using the Cosine ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 44, Issue 3

May, 2008

407 pages

ISSN:0306-4573

Issue’s Table of Contents

Copyright © Elsevier Ltd © 2007.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 May 2008

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin DWei VWong R(2023)Effective and Scalable Manifold Ranking-Based Image Retrieval with Output BoundACM Transactions on Knowledge Discovery from Data10.1145/356557417:5(1-31)Online publication date: 7-Apr-2023
https://dl.acm.org/doi/10.1145/3565574
Sheetrit EShtok AKurland O(2020)A passage-based approach to learning to rank documentsInformation Retrieval10.1007/s10791-020-09369-x23:2(159-186)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1007/s10791-020-09369-x
Chen LWu CDai Y(2015)Find relationship among applicationsInternational Journal of Networking and Virtual Organisations10.1504/IJNVO.2015.06930015:1(80-98)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1504/IJNVO.2015.069300
Eichmann PZgraggen EBrdiczka OChau PCarenini GPan SKristensson P(2015)Evaluating Subjective Accuracy in Time Series Pattern-Matching Using Human-Annotated RankingsProceedings of the 20th International Conference on Intelligent User Interfaces10.1145/2678025.2701379(28-37)Online publication date: 18-Mar-2015
https://dl.acm.org/doi/10.1145/2678025.2701379
Ming Liu Chong Wu Xiang-Nan Zhao Chin-Yew Lin Xiao-Long Wang (2015)APP Relationship Calculation: An Iterative ProcessIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.240555727:8(2049-2063)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1109/TKDE.2015.2405557
Tagarelli AKarypis G(2013)A segment-based approach to clustering multi-topic documentsKnowledge and Information Systems10.1007/s10115-012-0556-z34:3(563-595)Online publication date: 1-Mar-2013
https://dl.acm.org/doi/10.1007/s10115-012-0556-z
Wang BPan FHu KPaul J(2012)Manifold-ranking based retrieval using k-regular nearest neighbor graphPattern Recognition10.1016/j.patcog.2011.09.00645:4(1569-1577)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.1016/j.patcog.2011.09.006
Jiang QSun MLin D(2011)Semi-supervised SimHash for efficient document similarity searchProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 110.5555/2002472.2002485(93-101)Online publication date: 19-Jun-2011
https://dl.acm.org/doi/10.5555/2002472.2002485
Krikon EKurland OMa WNie JBaeza-Yates RChua TCroft W(2011)Utilizing minimal relevance feedback for ad hoc retrievalProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2010068(1099-1100)Online publication date: 24-Jul-2011
https://dl.acm.org/doi/10.1145/2009916.2010068
Wang JSarma ACataldo Mde Souza CDittrich YHoda RSharp H(2011)Which bug should I fixProceedings of the 4th International Workshop on Cooperative and Human Aspects of Software Engineering10.1145/1984642.1984661(76-79)Online publication date: 21-May-2011
https://dl.acm.org/doi/10.1145/1984642.1984661
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents