More Web Proxy on the site http://driver.im/

research-article

De-duping URLs via rewrite rules

Authors:

Anirban Dasgupta,

Amit SasturkarAuthors Info & Claims

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 186 - 194

https://doi.org/10.1145/1401890.1401917

Published: 24 August 2008 Publication History

Abstract

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.

In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.

References

[1]

R. Ananthakrisha, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouse. In Proc. 28th VLDB, pages 586--597, 2002.

Digital Library

[2]

D. Angluin. Finding patterns common to a set of strings (extended abstract). In Proc. of the 11th STOC, pages 130--141, 1979.

Digital Library

[3]

D. Angluin. Inference of reversible languages. J. ACM, 29(3):741--765, 1982.

Digital Library

[4]

D. Angluin and C. H. Smith. Inductive inference: Theory and methods. ACM Comput. Surv., 15(3):237--269, 1983.

Digital Library

[5]

Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different urls with similar text. In Proc. 16th WWW, pages 111--120, 2007.

Digital Library

[6]

M. Bognar. A survey of abstract rewriting, 1995. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.

[7]

A. Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.

Digital Library

[8]

A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.

Digital Library

[9]

M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380--388, 2002.

Digital Library

[10]

S. Chaudhuri, V. Ganti, and R. Motwani. Robust idenfication of fuzzy duplicates. In Proc. 21st ICDE, pages 865--876, 2005.

Digital Library

[11]

Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Adaptive graphical approach to entity resolution. In Proc. of the ACM/IEEE Joint Conference on Digital Libraries, pages 204--213, 2007.

Digital Library

[12]

D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proc. of the 1st Conference on Latin American Web Congress, page 37, 2003.

Digital Library

[13]

H. Garcia-Molina. Pair-wise entity resolution: Overview and challenges. In Proc. CIKM, 2006.

Digital Library

[14]

M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, 2006.

Digital Library

[15]

G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. of the 16th International Conference on World Wide Web, pages 141--150, 2007.

Digital Library

[16]

M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules, 2006. US Patent Application Publication, 2006/0218143.

[17]

A. Pereira, R. Baeza-Yates, and N. Ziviani. Where and how duplicates occur in the web. In Proc. of the 4th Latin American Web Congress, pages 127--134, 2006.

Digital Library

Cited By

Shao GXu ZHe XRao HHuang WDuan W(2024)Novel UGA Homologous URL Recognition in Real-World Financial Cybercrimes: Self-supervised Deep Learning of URL SemanticsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_22(300-312)Online publication date: 2-Sep-2024
https://doi.org/10.1007/978-981-97-5575-2_22
Deshmukh SChittekar P(2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
https://doi.org/10.1109/ICOEI48184.2020.9142922
Chittekar PDeshmukh S(2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
https://doi.org/10.1109/ICCMC.2019.8819753
Show More Cited By

Index Terms

De-duping URLs via rewrite rules
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

A pattern tree-based approach to learning URL normalization rules
WWW '10: Proceedings of the 19th international conference on World wide web

Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL ...
Identifying Equivalent URLs Using URL Signatures
SITIS '08: Proceedings of the 2008 IEEE International Conference on Signal Image Technology and Internet Based Systems

In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to enhance the standard URL normalization by incorporating the semantically meaningful metadata of the Web pages. The ...
Do not crawl in the DUST: Different URLs with similar text

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2008

1116 pages

ISBN:9781605581934

DOI:10.1145/1401890

General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD08

Sponsor:

KDD08: The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2008

Nevada, Las Vegas, USA

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
719
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shao GXu ZHe XRao HHuang WDuan W(2024)Novel UGA Homologous URL Recognition in Real-World Financial Cybercrimes: Self-supervised Deep Learning of URL SemanticsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_22(300-312)Online publication date: 2-Sep-2024
https://doi.org/10.1007/978-981-97-5575-2_22
Deshmukh SChittekar P(2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
https://doi.org/10.1109/ICOEI48184.2020.9142922
Chittekar PDeshmukh S(2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
https://doi.org/10.1109/ICCMC.2019.8819753
Rane PDalal M(2018)iDUSTER: Improved Method for Removing DUST Based on Efficient Multiple Sequence Alignment Technique2018 International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA.2018.8597326(1450-1454)Online publication date: Jul-2018
https://doi.org/10.1109/ICIRCA.2018.8597326
Langhi JJadhav S(2018)Parallel Crawling for Detection and Removal of DUST Using DUSTER2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697837(1-5)Online publication date: Aug-2018
https://doi.org/10.1109/ICCUBEA.2018.8697837
Xu KLiu ZCallan JKando NSakai TJoho HLi Hde Vries AWhite R(2017)De-duping URLs with Sequence-to-Sequence Neural NetworksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080746(1157-1160)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080746
Carvalho Cde Moura EVeloso AZiviani N(2017)Website replica detection with distant supervisionInformation Retrieval Journal10.1007/s10791-017-9320-z21:4(253-272)Online publication date: 29-Nov-2017
https://doi.org/10.1007/s10791-017-9320-z
Lawankar AMangrulkar N(2016)A review on techniques for optimizing web crawler results2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave)10.1109/STARTUP.2016.7583952(1-4)Online publication date: Feb-2016
https://doi.org/10.1109/STARTUP.2016.7583952
Kumari CJoshi DSingh S(2016)Canonization rules for detecting different URLs2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)10.1109/CONFLUENCE.2016.7508093(88-94)Online publication date: Jan-2016
https://doi.org/10.1109/CONFLUENCE.2016.7508093
Rodrigues KCristo MS de Moura Eda Silva A(2015)Removing DUST Using Multiple Alignment of SequencesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.240735427:8(2261-2274)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1109/TKDE.2015.2407354
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents