[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1401890.1401917acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

De-duping URLs via rewrite rules

Published: 24 August 2008 Publication History

Abstract

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.
In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.

References

[1]
R. Ananthakrisha, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouse. In Proc. 28th VLDB, pages 586--597, 2002.
[2]
D. Angluin. Finding patterns common to a set of strings (extended abstract). In Proc. of the 11th STOC, pages 130--141, 1979.
[3]
D. Angluin. Inference of reversible languages. J. ACM, 29(3):741--765, 1982.
[4]
D. Angluin and C. H. Smith. Inductive inference: Theory and methods. ACM Comput. Surv., 15(3):237--269, 1983.
[5]
Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different urls with similar text. In Proc. 16th WWW, pages 111--120, 2007.
[6]
M. Bognar. A survey of abstract rewriting, 1995. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.
[7]
A. Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.
[8]
A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.
[9]
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380--388, 2002.
[10]
S. Chaudhuri, V. Ganti, and R. Motwani. Robust idenfication of fuzzy duplicates. In Proc. 21st ICDE, pages 865--876, 2005.
[11]
Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Adaptive graphical approach to entity resolution. In Proc. of the ACM/IEEE Joint Conference on Digital Libraries, pages 204--213, 2007.
[12]
D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proc. of the 1st Conference on Latin American Web Congress, page 37, 2003.
[13]
H. Garcia-Molina. Pair-wise entity resolution: Overview and challenges. In Proc. CIKM, 2006.
[14]
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, 2006.
[15]
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. of the 16th International Conference on World Wide Web, pages 141--150, 2007.
[16]
M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules, 2006. US Patent Application Publication, 2006/0218143.
[17]
A. Pereira, R. Baeza-Yates, and N. Ziviani. Where and how duplicates occur in the web. In Proc. of the 4th Latin American Web Congress, pages 127--134, 2006.

Cited By

View all
  • (2024)Novel UGA Homologous URL Recognition in Real-World Financial Cybercrimes: Self-supervised Deep Learning of URL SemanticsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_22(300-312)Online publication date: 2-Sep-2024
  • (2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
  • (2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
  • General Chair:
  • Ying Li,
  • Program Chairs:
  • Bing Liu,
  • Sunita Sarawagi
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. URL normalization
  2. de-duping
  3. rewrite rules

Qualifiers

  • Research-article

Conference

KDD08

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Novel UGA Homologous URL Recognition in Real-World Financial Cybercrimes: Self-supervised Deep Learning of URL SemanticsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_22(300-312)Online publication date: 2-Sep-2024
  • (2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
  • (2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
  • (2018)iDUSTER: Improved Method for Removing DUST Based on Efficient Multiple Sequence Alignment Technique2018 International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA.2018.8597326(1450-1454)Online publication date: Jul-2018
  • (2018)Parallel Crawling for Detection and Removal of DUST Using DUSTER2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697837(1-5)Online publication date: Aug-2018
  • (2017)De-duping URLs with Sequence-to-Sequence Neural NetworksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080746(1157-1160)Online publication date: 7-Aug-2017
  • (2017)Website replica detection with distant supervisionInformation Retrieval Journal10.1007/s10791-017-9320-z21:4(253-272)Online publication date: 29-Nov-2017
  • (2016)A review on techniques for optimizing web crawler results2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave)10.1109/STARTUP.2016.7583952(1-4)Online publication date: Feb-2016
  • (2016)Canonization rules for detecting different URLs2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)10.1109/CONFLUENCE.2016.7508093(88-94)Online publication date: Jan-2016
  • (2015)Removing DUST Using Multiple Alignment of SequencesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.240735427:8(2261-2274)Online publication date: 1-Aug-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media