[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1772690.1772753acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

A pattern tree-based approach to learning URL normalization rules

Published: 26 April 2010 Publication History

Abstract

Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and the offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern tree-based approach, which is remarkably different from existing approaches. Most current approaches learn rewrite rules by iteratively inducing local duplicate pairs to more general forms, and inevitably suffer from noisy training data and are practically inefficient. Given a training set of URLs partitioned into duplicate clusters for a targeted website, we develop a simple yet efficient algorithm to automatically construct a URL pattern tree. With the pattern tree, the statistical information from all the training samples is leveraged to make the learning process more robust and reliable. The learning process is also accelerated as rules are directly summarized based on pattern tree nodes. In addition, from an engineering perspective, the pattern tree helps select deployable rules by removing conflicts and redundancies. An evaluation on more than 70 million duplicate URLs from 200 websites showed that the proposed approach achieves very promising performance, in terms of both de-duping effectiveness and computational efficiency.

References

[1]
Google webmaster central blog: specify your canonical. http://googlewebmastercentral.blogspot.com/2009/02/specifyyour-canonical.html.
[2]
Uniform Resource Identifier (URI): Generic Syntax. RFC3986. http://tools.ietf.org/html/rfc3986.
[3]
URL Normalization. http://en.wikipedia.org/wiki/URL normalization.
[4]
A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, and P. K. GM. URL normalization for de-duplication of web pages. In Proc. CIKM, pages 1987--1990, 2009.
[5]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.
[6]
D. Angluin. Finding patterns common to a set of strings. In SOTC, pages 130--141, 1979.
[7]
Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different URLs with similar text. In WWW, pages 111--120, 2007.
[8]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107--117, 1998.
[9]
A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks, 29(8-13):1157--1166, 1997.
[10]
A. C. Carvalho, E. S. Moura, A. S. Silva, K. Berlt, and A. Bezerra. A cost-effective method for detecting web site replicas on search engine databases. Data Knowl. Eng., 62(3):421--437, 2007.
[11]
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. SOTC, pages 380--388, 2002.
[12]
A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. TOIS, 20(2):171--191, 2002.
[13]
A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping URLs via rewrite rules. In KDD, pages 186--194, 2008.
[14]
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006.
[15]
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, pages 141--150, 2007.
[16]
M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules. US Patent Application Publication, 2006/0218143, Microsoft Corporation, 2006.
[17]
M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
[18]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proc. WWW, pages 131--140, 2008.

Cited By

View all
  • (2021)An Improved Feature Extraction Approach for Web Anomaly Detection Based on Semantic StructureSecurity and Communication Networks10.1155/2021/66611242021Online publication date: 1-Jan-2021
  • (2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
  • (2020)A Novel Web Anomaly Detection Approach Based on Semantic StructureSecurity and Privacy in Social Networks and Big Data10.1007/978-981-15-9031-3_2(20-33)Online publication date: 22-Sep-2020
  • Show More Cited By

Index Terms

  1. A pattern tree-based approach to learning URL normalization rules

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WWW '10: Proceedings of the 19th international conference on World wide web
      April 2010
      1407 pages
      ISBN:9781605587998
      DOI:10.1145/1772690

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 April 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. url deduplication
      2. url normalization
      3. url pattern

      Qualifiers

      • Research-article

      Conference

      WWW '10
      WWW '10: The 19th International World Wide Web Conference
      April 26 - 30, 2010
      North Carolina, Raleigh, USA

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)An Improved Feature Extraction Approach for Web Anomaly Detection Based on Semantic StructureSecurity and Communication Networks10.1155/2021/66611242021Online publication date: 1-Jan-2021
      • (2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
      • (2020)A Novel Web Anomaly Detection Approach Based on Semantic StructureSecurity and Privacy in Social Networks and Big Data10.1007/978-981-15-9031-3_2(20-33)Online publication date: 22-Sep-2020
      • (2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
      • (2018)iDUSTER: Improved Method for Removing DUST Based on Efficient Multiple Sequence Alignment Technique2018 International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA.2018.8597326(1450-1454)Online publication date: Jul-2018
      • (2018)Parallel Crawling for Detection and Removal of DUST Using DUSTER2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697837(1-5)Online publication date: Aug-2018
      • (2017)De-duping URLs with Sequence-to-Sequence Neural NetworksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080746(1157-1160)Online publication date: 7-Aug-2017
      • (2017)Homogeneity in Web Search ResultsACM Transactions on Intelligent Systems and Technology10.1145/30577318:5(1-35)Online publication date: 12-Jul-2017
      • (2017)Website replica detection with distant supervisionInformation Retrieval Journal10.1007/s10791-017-9320-z21:4(253-272)Online publication date: 29-Nov-2017
      • (2016)Canonization rules for detecting different URLs2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)10.1109/CONFLUENCE.2016.7508093(88-94)Online publication date: Jan-2016
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      EPUB

      View this article in ePub.

      ePub

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media