Abstract
Document similarity search aims to find documents similar to a query document in a text corpus and return a ranked list of similar documents. Most existing approaches to document similarity search compute similarity scores between the query and the documents based on a retrieval function (e.g. Cosine) and then rank the documents by their similarity scores. In this paper, we proposed a novel retrieval approach based on manifold-ranking of TextTiles to re-rank the initially retrieved documents. The proposed approach can make full use of the intrinsic global manifold structure for the TextTiles of the documents in the re-ranking process. Experimental results demonstrate that the proposed approach can significantly improve the retrieval performances based on different retrieval functions. TextTile is validated to be a better unit than the whole document in the manifold-ranking process.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allan, J., Carbonell, J., Doddington, G., Yamron, J.P., Yang, Y.: Topic detection and tracking pilot study: final report. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrival. ACM Press and Addison Wesley (1999)
Choi, F.: JTextTile: A free platform independent text segmentation algorithm. (1999), http://www.cs.man.ac.uk/~choif
Hearst, M.A.: Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM (1994)
Hearst, M.A.: TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Hearst, M.A., Plaunt, C.: Subtopic structuring for full-length document access. In: Proceedings of the 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA (1993)
Kaszkiel, M., Zobel, J.: Passage retrieval revisited. In: Proceedings of the 20th Annual International ACM/SIGIR Conference, Philadelphia, Pennsylvania (1997)
Kaufmann, S.: Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th conference on Association for Computational Linguistics, pp. 591–595 (1999)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Robertson, S., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proc. of the 17th International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp. 232–241 (1994)
Robertson, S., Walker, S., Beaulieu, M.: Okapi at TREC–7: automatic ad hoc, filtering, VLC and filtering tracks. In: Proceedings of TREC 1999 (1999)
Salton, G.: The SMART retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs (1991)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of SIGIR 1996 (1996)
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., SchÖlkopf, B.: Learning with local and global consistency. In: Proceedings of NIPS 2003 (2003)
Zhou, D., Weston, J., Gretton, A., Bousquet, O., SchÖlkopf, B.: Ranking on data manifolds. In: Proceedings of NIPS 2003 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wan, X., Yang, J., Xiao, J. (2006). Document Similarity Search Based on Manifold-Ranking of TextTiles. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_2
Download citation
DOI: https://doi.org/10.1007/11880592_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)