[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Towards a unified approach to document similarity search using manifold-ranking of blocks

Published: 01 May 2008 Publication History

Abstract

Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.

References

[1]
Allan, J., Carbonell, J., Doddington, G., Yamron, J. P. & Yang, Y. (1998). Topic detection and tracking pilot study: Final report. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (pp. 194-218).
[2]
Modern information retrival. ACM Press and Addison Wesley.
[3]
Cai, D., Yu, S., Wen, J. -R., & Ma, W. -Y. (2003). VIPS: A vision based page segmentation algorithm. Microsoft technical report, MSR-TR-2003-79.
[4]
Cai, D., He, X., Wen, J. -R., & Ma, W. -Y. (2004). Block-level link analysis. In Proceedings of the 27th annual international ACM SIGIR conference (SIGIR'2004).
[5]
Callan, J. (1994). Passage-level evidence in document retrieval, In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 302-310).
[6]
Chen, J., Zhou, B., Shi, J., Zhang, H. -J., & Qiu, F. (2001). Function-based object model towards website adaptation. In Proceedings of the 10th world wide web conference (WWW10).
[7]
Choi, F. (1999). JTextTile: A free platform independent text segmentation algorithm. http://www.cs.man.ac.uk/~choif.
[8]
Language modeling for information retrieval. Kluwer Academic Publishers.
[9]
Cruz, I. F., Borisov, S., Marks, M. A., Webb, T. R. (1998). Measuring structural similarity among web documents: Preliminary results. In Proceedings of the 7th international conference on electronic publishing (pp. 513-524).
[10]
Dean, J., & Henzinger, M. R. (1999). Finding related pages in the World Wide Web. In Proceedings of the eighth international conference on world wide web (pp. 1467-1479).
[11]
Diaz, F. (2005). Regularizing ad hoc retrieval scores. In Proceedings of the 14th ACM international conference on information and knowledge management (CIKM'2005).
[12]
LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research (JAIR). v22. 457-479.
[13]
Fogaras, D., & Rácz, B. (2004). Scaling link-based similarity search. Technical report.
[14]
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P. (2002). Evaluating strategies for similarity search on the Web. In Proceedings of WWW2002 (pp. 432-442).
[15]
Hearst, M. A. (1994). Multi-paragraph segmentation of expository text. In Proceedings of the 32PndPmeeting of the association for computational linguistics, Los Cruces, NM.
[16]
TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics. v23 i1. 33-64.
[17]
Hearst, M. A., & Pedersen, O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR1996 (pp. 76-84).
[18]
Hearst, M. A., & Plaunt, C. (1993). Subtopic structuring for full-length document access. In Proceedings of the 16PthPannual international ACM/SIGIR conference, Pittsburgh, PA.
[19]
Iwayama, M., Fujii, A., Kando, N., Marukawa, Y. (2003). An empirical study on retrieval models for different document genres: Patents and newspaper articles. In Proceedings of SIGIR2003.
[20]
Jeh, G., & Widom, J. (2002). SimRank: A measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining.
[21]
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S. (2003). A bag of paths model for measuring structural similarity in web documents. In Proceedings of the 9th ACM SIGKDD conference (pp. 577-582).
[22]
Passage retrieval revisited. ACM SIGIR Forum. v31 iSI. 178-185.
[23]
Kaufmann, S. (1999). Cohesion and collocation: Using context vectors in text segmentation, In Proceedings of the 37th conference on association for computational linguistics (pp. 591-595).
[24]
Authoritative sources in a hyperlinked environment. Journal of the ACM. v46 i5. 604-632.
[25]
Kovacevic, M., Diligenti, M., Gori, M., & Milutinovic, V. (2002). Recognition of common areas in a web page using visual information: A possible application in a page classification. In Proceedings of 2002 IEEE international conference on data mining (ICDM'02), Maebashi City, Japan.
[26]
Kurland, O., Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR2005 (pp. 306-313).
[27]
Kurland, O., Lee, L. (2006). Respect my authority! HITS without hyperlinks: utilizing cluster-based language models. In Proceedings of SIGIR2006.
[28]
Kurland, O., Lee, L., Domshlak, C. (2005). Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proceedings of SIGIR2005.
[29]
Lin, Z., Lyu, M.R., King, I. (2006). PageSim: A novel link-based measure of web page similarity. In Proceeding of the 15th international world wide web conference.
[30]
Mihalcea, R., Tarau, P. (2004). Textrank: Bringing order into texts. In Proceedings of EMNLP 2004 (pp. 404-411).
[31]
Otterbacher, J., Erkan, G., Radev, D. (2005). Using random walks for question-focused sentence retrieval. In Proc. HLT/EMNLP 2005.
[32]
Page, L., Brin, S., Motwani, R., Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web, Technical report. Stanford, CA: Stanford University.
[33]
An algorithm for suffix stripping. Program. v14 i3. 130-137.
[34]
Qin, T., Liu, T. -Y., Zhang, X. -D., Chen, Z., Ma, W. -Y. (2005). A study of relevance propagation for web search. In Proceedings of SIGIR2005.
[35]
Robertson, S., Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of the 17th international ACM/SIGIR conference on research and development in information retrieval (pp. 232-241).
[36]
Robertson, S., Walker, S., Beaulieu, M. (1999) Okapi at TREC-7: Automatic ad hoc, filtering, VLC and filtering tracks. In Proceedings of TREC'99.
[37]
The SMART retrieval system: Experiments in automatic document processing. Prentice-Hall.
[38]
A vector space model for automatic indexing. Communications of the ACM. v18 i11. 613-620.
[39]
Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In Proceedings of SIGIR'96.
[40]
Smucker, M.D., & Allan, J. (2006). Find-similar: Similarity browsing as a search tool. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR'2006) (pp. 461-468).
[41]
Song, R., Liu, H., Wen, J. -R., & Ma, W. -Y. (2004). Learning block importance models for web pages. In Proceeding of the thirteenth world wide web conference (WWW 2004) (pp. 203-211).
[42]
Tombros, A., & Ali, Z. (2005). Factors affecting web page similarity. In Proceedings of ECIR2005.
[43]
Information retrieval. Butterworths, London.
[44]
Xue, G. -R., Zeng, H. -J., Chen, Z., & Yu, Y. (2004). MRSSA: An iterative algorithm for similarity spreading over interrelated objects. In Proceedings of CIKM2004.
[45]
Yu, S., Cai, D., Wen, J. -R., Ma, W. -Y. (2003). Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In Proceedings of the twelfth international world wide web conference (WWW2003).
[46]
Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., Ma, W. -Y. (2005). Improving web search results using affinity graph. In Proceedings of SIGIR2005.
[47]
Zhou, D., Weston, J., Gretton, A., Bousquet, O., Schölkopf, B. (2003). Ranking on data manifolds. In Proceedings of NIPS-2003.
[48]
Zhou, D., Bousquet, O., Lal, T. N., Weston, J., Schölkopf, B. (2003). Learning with local and global consistency. In Proceedings of NIPS-2003.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 44, Issue 3
May, 2008
407 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 May 2008

Author Tags

  1. Document segmentation
  2. Document similarity search
  3. Manifold-ranking
  4. Web page segmentation
  5. Web similarity search

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Effective and Scalable Manifold Ranking-Based Image Retrieval with Output BoundACM Transactions on Knowledge Discovery from Data10.1145/356557417:5(1-31)Online publication date: 7-Apr-2023
  • (2020)A passage-based approach to learning to rank documentsInformation Retrieval10.1007/s10791-020-09369-x23:2(159-186)Online publication date: 1-Apr-2020
  • (2015)Find relationship among applicationsInternational Journal of Networking and Virtual Organisations10.1504/IJNVO.2015.06930015:1(80-98)Online publication date: 1-May-2015
  • (2015)Evaluating Subjective Accuracy in Time Series Pattern-Matching Using Human-Annotated RankingsProceedings of the 20th International Conference on Intelligent User Interfaces10.1145/2678025.2701379(28-37)Online publication date: 18-Mar-2015
  • (2015)APP Relationship Calculation: An Iterative ProcessIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.240555727:8(2049-2063)Online publication date: 1-Aug-2015
  • (2013)A segment-based approach to clustering multi-topic documentsKnowledge and Information Systems10.1007/s10115-012-0556-z34:3(563-595)Online publication date: 1-Mar-2013
  • (2012)Manifold-ranking based retrieval using k-regular nearest neighbor graphPattern Recognition10.1016/j.patcog.2011.09.00645:4(1569-1577)Online publication date: 1-Apr-2012
  • (2011)Semi-supervised SimHash for efficient document similarity searchProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 110.5555/2002472.2002485(93-101)Online publication date: 19-Jun-2011
  • (2011)Utilizing minimal relevance feedback for ad hoc retrievalProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2010068(1099-1100)Online publication date: 24-Jul-2011
  • (2011)Which bug should I fixProceedings of the 4th International Workshop on Cooperative and Human Aspects of Software Engineering10.1145/1984642.1984661(76-79)Online publication date: 21-May-2011
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media