Article

Finding similar files in large document repositories

Authors:

George Forman,

Kave Eshghi,

Stephane ChiocchettiAuthors Info & Claims

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Pages 394 - 400

https://doi.org/10.1145/1081870.1081916

Published: 21 August 2005 Publication History

Get Access

Abstract

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.

References

[1]

S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398--409, San Jose, CA, 1995.]]

Digital Library

Google Scholar

[2]

A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig. Syntactic Clustering of the Web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997.]]

Digital Library

Google Scholar

[3]

K. Eshghi and H.K. Tang . A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs Technical Report TR 2005-30.]]

Google Scholar

[4]

R.A. Finkel, A. Zaslavsky, K. Monostori, and H. Schmidt. Signature extraction for overlap detection in documents. In Proceedings of the 25th Australasian Conference on Computer Science, v4, pages 59--64, Melbourne, Australia, 2002.]]

Digital Library

Google Scholar

[5]

V. Henson and R. Henderson. Guidelines for Using Compare-by-Hash. Forthcoming, 2005. http://infohost.nmt.edu/~val/review/hash2.html]]

Google Scholar

[6]

U. Manber. Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Technical Conference, San Francisco, CA, January 1994.]]

Digital Library

Google Scholar

[7]

A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01), pages 174--187, Banff, Canada, October 2001.]]

Digital Library

Google Scholar

[8]

M.O. Rabin. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard Univ., Cambridge, Mass., 1981.]]

Google Scholar

Cited By

View all

Deng CZou XChen QTang BXia W(2024)The Design of a Lossless Deduplication Scheme to Eliminate Fine-Grained Redundancy for JPEG Image Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336345673:5(1385-1399)Online publication date: May-2024
https://doi.org/10.1109/TC.2024.3363456
Henriques JCaldeira FCruz TSimões P(2024)A Survey on Forensics and Compliance Auditing for Critical Infrastructure ProtectionIEEE Access10.1109/ACCESS.2023.334855212(2409-2444)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3348552
Wang CWang KLi MWei FXiong N(2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
https://doi.org/10.1049/cmu2.12719
Show More Cited By

Index Terms

Finding similar files in large document repositories
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Finding similar files in a large file system
WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, ...
Supporting document management by using RFID technology

A transition from traditional paper documents to digital ones is taking place, and standard ways for electronically managing documents are experimented. However, documents having legal value cannot be completely replaced by digital ones and must be kept ...
Combining preference- and content-based approaches for improving document clustering effectiveness

E-commerce and knowledge management applications generate and consume tremendous amounts of online information that is typically available as textual documents. To facilitate subsequent access of and leverage from these textual documents, the efficient ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

August 2005

844 pages

ISBN:159593135X

DOI:10.1145/1081870

General Chair:
Robert Grossman
University of Illinois at Chicago & Open Data Partners, USA
,
Program Chairs:
Roberto Bayardo
IBM Almaden Research, USA
,
Kristin Bennett
RPI, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD05

Sponsor:

KDD05: The Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2005

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
1,142
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Deng CZou XChen QTang BXia W(2024)The Design of a Lossless Deduplication Scheme to Eliminate Fine-Grained Redundancy for JPEG Image Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336345673:5(1385-1399)Online publication date: May-2024
https://doi.org/10.1109/TC.2024.3363456
Henriques JCaldeira FCruz TSimões P(2024)A Survey on Forensics and Compliance Auditing for Critical Infrastructure ProtectionIEEE Access10.1109/ACCESS.2023.334855212(2409-2444)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3348552
Wang CWang KLi MWei FXiong N(2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
https://doi.org/10.1049/cmu2.12719
Xia WPu LZou XShilane PLi SZhang HWang X(2023)The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta CompressionACM Transactions on Storage10.1145/358466319:3(1-30)Online publication date: 19-Jun-2023
https://dl.acm.org/doi/10.1145/3584663
Li BTian WLi RXiao WFu ZYe XDuan RLi YXu Z(2022)Cross-domain Resemblance Detection based on Meta-learning for Cloud Storage2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC55026.2022.9894349(374-379)Online publication date: 11-Nov-2022
https://doi.org/10.1109/IPCCC55026.2022.9894349
Deng CChen QZou XXu ETang BXia W(2022)imDedup: A Lossless Deduplication Scheme to Eliminate Fine-grained Redundancy among Images2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00085(1071-1084)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00085
Geng YTian WLi RXiao WOuyang CLiu YLiu QLi JYe XXu Z(2022)Context-aware Resemblance Detection based Deduplication Ratio Prediction for Cloud Storage2022 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT)10.1109/BDCAT56447.2022.00011(21-29)Online publication date: Dec-2022
https://doi.org/10.1109/BDCAT56447.2022.00011
Nielsen LBurihabwa DSchiavoni VFelber PLucani D(2021)MinervaFS: A User-Space File System for Generalised Deduplication: (Practical experience report)2021 40th International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS53918.2021.00033(254-264)Online publication date: Sep-2021
https://doi.org/10.1109/SRDS53918.2021.00033
Ye XTang JTian WLi RXiao WGeng YXu Z(2021)Fast Variable-Grained Resemblance Data Deduplication For Cloud Storage2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605398(1-8)Online publication date: Oct-2021
https://doi.org/10.1109/NAS51552.2021.9605398
Zou XDeng CXia WShilane PTan HZhang HWang X(2021)Odess: Speeding up Resemblance Detection for Redundancy Elimination by Fast Content-Defined Sampling2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00048(480-491)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00048
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Finding similar files in a large file system

Supporting document management by using RFID technology

Combining preference- and content-based approaches for improving document clustering effectiveness