More Web Proxy on the site http://driver.im/

research-article

A pattern tree-based approach to learning URL normalization rules

Authors:

Jiang-Ming Yang,

Lei ZhangAuthors Info & Claims

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 611 - 620

https://doi.org/10.1145/1772690.1772753

Published: 26 April 2010 Publication History

Abstract

Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and the offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern tree-based approach, which is remarkably different from existing approaches. Most current approaches learn rewrite rules by iteratively inducing local duplicate pairs to more general forms, and inevitably suffer from noisy training data and are practically inefficient. Given a training set of URLs partitioned into duplicate clusters for a targeted website, we develop a simple yet efficient algorithm to automatically construct a URL pattern tree. With the pattern tree, the statistical information from all the training samples is leveraged to make the learning process more robust and reliable. The learning process is also accelerated as rules are directly summarized based on pattern tree nodes. In addition, from an engineering perspective, the pattern tree helps select deployable rules by removing conflicts and redundancies. An evaluation on more than 70 million duplicate URLs from 200 websites showed that the proposed approach achieves very promising performance, in terms of both de-duping effectiveness and computational efficiency.

References

[1]

Google webmaster central blog: specify your canonical. http://googlewebmastercentral.blogspot.com/2009/02/specifyyour-canonical.html.

[2]

Uniform Resource Identifier (URI): Generic Syntax. RFC3986. http://tools.ietf.org/html/rfc3986.

[3]

URL Normalization. http://en.wikipedia.org/wiki/URL normalization.

[4]

A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, and P. K. GM. URL normalization for de-duplication of web pages. In Proc. CIKM, pages 1987--1990, 2009.

Digital Library

[5]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.

Digital Library

[6]

D. Angluin. Finding patterns common to a set of strings. In SOTC, pages 130--141, 1979.

Digital Library

[7]

Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different URLs with similar text. In WWW, pages 111--120, 2007.

Digital Library

[8]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107--117, 1998.

Digital Library

[9]

A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks, 29(8-13):1157--1166, 1997.

Digital Library

[10]

A. C. Carvalho, E. S. Moura, A. S. Silva, K. Berlt, and A. Bezerra. A cost-effective method for detecting web site replicas on search engine databases. Data Knowl. Eng., 62(3):421--437, 2007.

Digital Library

[11]

M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. SOTC, pages 380--388, 2002.

Digital Library

[12]

A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. TOIS, 20(2):171--191, 2002.

Digital Library

[13]

A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping URLs via rewrite rules. In KDD, pages 186--194, 2008.

Digital Library

[14]

M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006.

Digital Library

[15]

G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, pages 141--150, 2007.

Digital Library

[16]

M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules. US Patent Application Publication, 2006/0218143, Microsoft Corporation, 2006.

[17]

M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

[18]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proc. WWW, pages 131--140, 2008.

Digital Library

Cited By

Cheng ZCui BQi TYang WFu J(2021)An Improved Feature Extraction Approach for Web Anomaly Detection Based on Semantic StructureSecurity and Communication Networks10.1155/2021/66611242021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/6661124
Deshmukh SChittekar P(2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
https://doi.org/10.1109/ICOEI48184.2020.9142922
Cheng ZCui BFu J(2020)A Novel Web Anomaly Detection Approach Based on Semantic StructureSecurity and Privacy in Social Networks and Big Data10.1007/978-981-15-9031-3_2(20-33)Online publication date: 22-Sep-2020
https://doi.org/10.1007/978-981-15-9031-3_2
Show More Cited By

Index Terms

A pattern tree-based approach to learning URL normalization rules
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

De-duping URLs via rewrite rules
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and ...
URL normalization for de-duplication of web pages
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt ...
Do not crawl in the DUST: Different URLs with similar text

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '10: Proceedings of the 19th international conference on World wide web

April 2010

1407 pages

ISBN:9781605587998

DOI:10.1145/1772690

General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India

Copyright © 2010 International World Wide Web Conference Committee (IW3C2).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '10

WWW '10: The 19th International World Wide Web Conference

April 26 - 30, 2010

North Carolina, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
553
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)3

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheng ZCui BQi TYang WFu J(2021)An Improved Feature Extraction Approach for Web Anomaly Detection Based on Semantic StructureSecurity and Communication Networks10.1155/2021/66611242021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/6661124
Deshmukh SChittekar P(2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
https://doi.org/10.1109/ICOEI48184.2020.9142922
Cheng ZCui BFu J(2020)A Novel Web Anomaly Detection Approach Based on Semantic StructureSecurity and Privacy in Social Networks and Big Data10.1007/978-981-15-9031-3_2(20-33)Online publication date: 22-Sep-2020
https://doi.org/10.1007/978-981-15-9031-3_2
Chittekar PDeshmukh S(2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
https://doi.org/10.1109/ICCMC.2019.8819753
Rane PDalal M(2018)iDUSTER: Improved Method for Removing DUST Based on Efficient Multiple Sequence Alignment Technique2018 International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA.2018.8597326(1450-1454)Online publication date: Jul-2018
https://doi.org/10.1109/ICIRCA.2018.8597326
Langhi JJadhav S(2018)Parallel Crawling for Detection and Removal of DUST Using DUSTER2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697837(1-5)Online publication date: Aug-2018
https://doi.org/10.1109/ICCUBEA.2018.8697837
Xu KLiu ZCallan JKando NSakai TJoho HLi Hde Vries AWhite R(2017)De-duping URLs with Sequence-to-Sequence Neural NetworksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080746(1157-1160)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080746
Agrawal RGolshan BPapalexakis E(2017)Homogeneity in Web Search ResultsACM Transactions on Intelligent Systems and Technology10.1145/30577318:5(1-35)Online publication date: 12-Jul-2017
https://dl.acm.org/doi/10.1145/3057731
Carvalho Cde Moura EVeloso AZiviani N(2017)Website replica detection with distant supervisionInformation Retrieval Journal10.1007/s10791-017-9320-z21:4(253-272)Online publication date: 29-Nov-2017
https://doi.org/10.1007/s10791-017-9320-z
Kumari CJoshi DSingh S(2016)Canonization rules for detecting different URLs2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)10.1109/CONFLUENCE.2016.7508093(88-94)Online publication date: Jan-2016
https://doi.org/10.1109/CONFLUENCE.2016.7508093
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

EPUB

View this article in ePub.

Media

Figures

Other

Tables

View Table of Contents