[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1645953.1646283acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

URL normalization for de-duplication of web pages

Published: 02 November 2009 Publication History

Abstract

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster. Preserving each mined rules for de-duplication is not efficient due to the large number of specific rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We demonstrate the effectiveness of our techniques through experimental evaluation.

References

[1]
S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 280--290, May 2003.
[2]
Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different urls with similar text. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 111--120, May 2007.
[3]
T. Berners-Lee, L. Masinter, and M. McCahill. Uniform resource locators (url), 1994.
[4]
A. Broder. On the resemblance and containment of documents. In SEQUENCES '97: Proceedings of the Compression and Complexity of Sequences 1997, page 21, June 1997.
[5]
A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping urls via rewrite rules. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 186--194, August 2008.
[6]
D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB '03: Proceedings of the First Conference on Latin American Web Congress, page 37, November 2003.
[7]
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, 1997.
[8]
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284--291, August 2006.
[9]
S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, 2003.
[10]
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 141--150, May 2007.
[11]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, November 1999.
[12]
J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81--106, March 1986.

Cited By

View all
  • (2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
  • (2021)CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common CrawlProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463246(2398-2404)Online publication date: 11-Jul-2021
  • (2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. decision tree
  2. page importance
  3. search engines
  4. url de-duplication

Qualifiers

  • Poster

Conference

CIKM '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)4
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
  • (2021)CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common CrawlProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463246(2398-2404)Online publication date: 11-Jul-2021
  • (2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
  • (2019)Feature Enhancement via User Similarities Networks for Improved Click Prediction in Yahoo Gemini NativeProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357821(2557-2565)Online publication date: 3-Nov-2019
  • (2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
  • (2018)iDUSTER: Improved Method for Removing DUST Based on Efficient Multiple Sequence Alignment Technique2018 International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA.2018.8597326(1450-1454)Online publication date: Jul-2018
  • (2018)Parallel Crawling for Detection and Removal of DUST Using DUSTER2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697837(1-5)Online publication date: Aug-2018
  • (2017)Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHPInternational Journal of Rough Sets and Data Analysis10.4018/IJRSDA.20170101064:1(95-110)Online publication date: Jan-2017
  • (2017)Website replica detection with distant supervisionInformation Retrieval Journal10.1007/s10791-017-9320-z21:4(253-272)Online publication date: 29-Nov-2017
  • (2016)CLUE: Clustering for Mining Web URLs2016 28th International Teletraffic Congress (ITC 28)10.1109/ITC-28.2016.146(286-294)Online publication date: Sep-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media