[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/584792.584835acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Detecting similar documents using salient terms

Published: 04 November 2002 Publication History

Abstract

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.

References

[1]
Brown, Eric W. and Prager, John M., US Patent 05913208.
[2]
Broder, Andrei Z, Glassman, Steven C., Manasse, Mark and Zweig, Geoffrey "Syntactic Clustering of the Web," Proceedings of the Sixth WWW Conference. Santa Clara, CA, 1997.
[3]
Rabin, M. O., "Fingerprinting by random polynomials, " Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
[4]
Bloomfield, Louis, University of Virginia, interviewed on NPR's All Things Considered, May 9, 2001. See www.plagiarism.phys.virginia.edu.
[5]
Cooper, J. W. and Byrd, R J, "Lexical Navigation: Visually Prompted Query Refinement," ACM Digital Libraries Conference, Philadelphia, 1997.
[6]
Cooper, James W. and Byrd, Roy J., OBIWAN - "A Visual Interface for Prompted Query Refinement," Proceedings of HICSS-31, Kona, Hawaii, 1998.
[7]
Ravin, Y. and Wacholder, N. 1996, "Extracting Names from Natural-Language Text," IBM Research Report 20338.
[8]
Justeson, J. S. and S. Katz "Technical terminology: some linguistic properties and an algorithm for identification in text." Natural Language Engineering, 1, 9--27, 1995.
[9]
Byrd, R.J. and Ravin, Y. Identifying and Extracting Relations in Text. Proceedings of NLDB 99, Klagenfurt, Austria.
[10]
Mnis-Textwise Labs, www.textwise.com <http://www.textwise.com>. DR-LINK was developed at Syracuse University and is marketed by Textwise.
[11]
Evans, D. K., Klavans, J. and Wacholder, N., "Document Processing with LinkIT," Proc. Of the RIAO Conference, Paris, France, 2000.
[12]
InXight, Inc. www.inxight.com
[13]
Neff, Mary S. and Cooper, James W. "Document Summarization for Active Markup," in Proceedings of the 32nd Hawaii International Conference on System Sciences, Wailea, HI, January, 1999.
[14]
Cooper J.W. and Prager, John M. "Anti-Serendipity - Finding Useless Documents and Similar Documents," Proceedings of the 33rd Hawaii International Conference on System Sciences, Maui, HI, January, 2000.
[15]
Cooper, J. W. "The Technology of Lexical Navigation," Workshop on Browsing Technology, First Joint Conference on Digital Libraries, Roanoke, VA, 2001.
[16]
Cooper, J.W., Cesar, C., So, Edward, and Mack R. L., "Construction of an OO Framework for Text Mining," OOPSLA, Tampa Bay, 2001.
[17]
Gemini plug-in for Adobe Acrobat Reader, Iceni Technology, Ltd, Norwich, England, www.iceni.com <http://www.iceni.com>.
[18]
Selker, T. and Burleson, W. "Context-aware Design and Interaction in Computer Systems," IBM Systems Journal, 39, 891 (2000).
[19]
Cooper, J W, "Loading Your Databases," JavaPro, May, 2000.

Cited By

View all
  • (2021)Qualitative and quantitative research in the humanities and social sciences: how natural language processing (NLP) can helpQuality & Quantity10.1007/s11135-021-01235-256:4(2751-2781)Online publication date: 23-Sep-2021
  • (2021)Reducing the Number of Queries in the Search Text Fuzzy DuplicatesAdvances in Automation II10.1007/978-3-030-71119-1_45(460-467)Online publication date: 19-Mar-2021
  • (2020)Identification of Key Sentences in the Task of Text Duplicate Detection2020 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF)10.1109/WECONF48837.2020.9131465(1-5)Online publication date: Jun-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
November 2002
704 pages
ISBN:1581134924
DOI:10.1145/584792
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. databases
  2. document similarity
  3. duplicate documents
  4. shingles
  5. text mining

Qualifiers

  • Article

Conference

CIKM02

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Qualitative and quantitative research in the humanities and social sciences: how natural language processing (NLP) can helpQuality & Quantity10.1007/s11135-021-01235-256:4(2751-2781)Online publication date: 23-Sep-2021
  • (2021)Reducing the Number of Queries in the Search Text Fuzzy DuplicatesAdvances in Automation II10.1007/978-3-030-71119-1_45(460-467)Online publication date: 19-Mar-2021
  • (2020)Identification of Key Sentences in the Task of Text Duplicate Detection2020 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF)10.1109/WECONF48837.2020.9131465(1-5)Online publication date: Jun-2020
  • (2019)Resource-Efficient Index Shard Replication in Large Scale Search EnginesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.292442330:12(2820-2835)Online publication date: 1-Dec-2019
  • (2013)An Efficient Pretopological Approach for Document ClusteringProceedings of the 2013 5th International Conference on Intelligent Networking and Collaborative Systems10.1109/INCoS.2013.25(114-120)Online publication date: 9-Sep-2013
  • (2011)Partial duplicate detection for large book collectionsProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063647(469-474)Online publication date: 24-Oct-2011
  • (2011)A system for the proactive, continuous, and efficient collection of digital forensic evidenceDigital Investigation: The International Journal of Digital Forensics & Incident Response10.1016/j.diin.2011.05.0028(S3-S13)Online publication date: 1-Aug-2011
  • (2010)Web Conferencing Software in University-Level, E-Learning-Based, Technical CoursesJournal of Educational Technology Systems10.2190/ET.38.3.f38:3(367-381)Online publication date: 28-May-2010
  • (2010)Detection and optimized disposal of near-duplicate pages2010 2nd International Conference on Future Computer and Communication10.1109/ICFCC.2010.5497544(V2-604-V2-607)Online publication date: May-2010
  • (2010)Analysis of Duplicated Web Pages Identification Methods in Search Engine2010 2nd International Workshop on Database Technology and Applications10.1109/DBTA.2010.5659105(1-5)Online publication date: Nov-2010
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media