[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1081870.1081916acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Finding similar files in large document repositories

Published: 21 August 2005 Publication History

Abstract

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.

References

[1]
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398--409, San Jose, CA, 1995.]]
[2]
A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig. Syntactic Clustering of the Web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997.]]
[3]
K. Eshghi and H.K. Tang . A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs Technical Report TR 2005-30.]]
[4]
R.A. Finkel, A. Zaslavsky, K. Monostori, and H. Schmidt. Signature extraction for overlap detection in documents. In Proceedings of the 25th Australasian Conference on Computer Science, v4, pages 59--64, Melbourne, Australia, 2002.]]
[5]
V. Henson and R. Henderson. Guidelines for Using Compare-by-Hash. Forthcoming, 2005. http://infohost.nmt.edu/~val/review/hash2.html]]
[6]
U. Manber. Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Technical Conference, San Francisco, CA, January 1994.]]
[7]
A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01), pages 174--187, Banff, Canada, October 2001.]]
[8]
M.O. Rabin. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard Univ., Cambridge, Mass., 1981.]]

Cited By

View all
  • (2024)The Design of a Lossless Deduplication Scheme to Eliminate Fine-Grained Redundancy for JPEG Image Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336345673:5(1385-1399)Online publication date: May-2024
  • (2024)A Survey on Forensics and Compliance Auditing for Critical Infrastructure ProtectionIEEE Access10.1109/ACCESS.2023.334855212(2409-2444)Online publication date: 2024
  • (2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
August 2005
844 pages
ISBN:159593135X
DOI:10.1145/1081870
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content management
  2. document management
  3. near duplicate detection
  4. scalability
  5. similarity

Qualifiers

  • Article

Conference

KDD05

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Design of a Lossless Deduplication Scheme to Eliminate Fine-Grained Redundancy for JPEG Image Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336345673:5(1385-1399)Online publication date: May-2024
  • (2024)A Survey on Forensics and Compliance Auditing for Critical Infrastructure ProtectionIEEE Access10.1109/ACCESS.2023.334855212(2409-2444)Online publication date: 2024
  • (2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
  • (2023)The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta CompressionACM Transactions on Storage10.1145/358466319:3(1-30)Online publication date: 19-Jun-2023
  • (2022)Cross-domain Resemblance Detection based on Meta-learning for Cloud Storage2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894349(374-379)Online publication date: 11-Nov-2022
  • (2022)imDedup: A Lossless Deduplication Scheme to Eliminate Fine-grained Redundancy among Images2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00085(1071-1084)Online publication date: May-2022
  • (2022)Context-aware Resemblance Detection based Deduplication Ratio Prediction for Cloud Storage2022 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT)10.1109/BDCAT56447.2022.00011(21-29)Online publication date: Dec-2022
  • (2021)MinervaFS: A User-Space File System for Generalised Deduplication: (Practical experience report)2021 40th International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS53918.2021.00033(254-264)Online publication date: Sep-2021
  • (2021)Fast Variable-Grained Resemblance Data Deduplication For Cloud Storage2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605398(1-8)Online publication date: Oct-2021
  • (2021)Odess: Speeding up Resemblance Detection for Redundancy Elimination by Fast Content-Defined Sampling2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00048(480-491)Online publication date: Apr-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media