Abstract
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype system that makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)
Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, pp. 398–409 (1995)
Broder, A.Z.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29 (1997)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)
Heintze, N.: Scalable Document Fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (1996)
Hoad, T.C., Zobel, J.: ‘Methods for Identifying Versioned and Plagiarised Documents’. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)
Larsson, N.J., Moffat, A.: Offline Dictionary-Based Compression 88(11), 1722–1732 (2000)
Manber, U.: Finding Similar Files in a Large File System. In: Proceedings of the USENIX Winter, Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)
Moffat, A., Wan, R.: Re-Store: A System for Compressing, Browsing, and Searching Large Documents. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp. 162–174. IEEE Computer Society, Los Alamitos (2001)
Nevill-Manning, C.G., Witten, I.H.: Compression and Explanation Using Hierarchical Grammars. The Computer Journal 40(2/3), 103–116 (1997)
Nevill-Manning, C.G., Witten, I.H., Paynter, G.W.: Browsing in digital libraries: a phrase-based approach. In: Proceedings of the second ACM international conference on Digital libraries, pp. 230–236. ACM Press, New York (1997)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pp. 76–85. ACM Press, New York (2003)
Shivakumar, N., García-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)
Shivakumar, N., García-Molina, H.: Finding Near-Replicas of Documents on the Web. In: WEBDB: International Workshop on the World Wide Web and Databases, WebDB, Springer, Heidelberg (1999)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bernstein, Y., Zobel, J. (2004). A Scalable System for Identifying Co-derivative Documents. In: Apostolico, A., Melucci, M. (eds) String Processing and Information Retrieval. SPIRE 2004. Lecture Notes in Computer Science, vol 3246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30213-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-30213-1_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23210-0
Online ISBN: 978-3-540-30213-1
eBook Packages: Springer Book Archive