Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster. Preserving each mined rules for de-duplication is not efficient due to the large number of specific rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We demonstrate the effectiveness of our techniques through experimental evaluation.

References

[1]

S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 280--290, May 2003.

Digital Library

Google Scholar

[2]

Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different urls with similar text. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 111--120, May 2007.

Digital Library

Google Scholar

[3]

T. Berners-Lee, L. Masinter, and M. McCahill. Uniform resource locators (url), 1994.

Google Scholar

[4]

A. Broder. On the resemblance and containment of documents. In SEQUENCES '97: Proceedings of the Compression and Complexity of Sequences 1997, page 21, June 1997.

Digital Library

Google Scholar

[5]

A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping urls via rewrite rules. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 186--194, August 2008.

Digital Library

Google Scholar

[6]

D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB '03: Proceedings of the First Conference on Latin American Web Congress, page 37, November 2003.

Digital Library

Google Scholar

[7]

D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, 1997.

Digital Library

Google Scholar

[8]

M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284--291, August 2006.

Digital Library

Google Scholar

[9]

S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, 2003.

Google Scholar

[10]

G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 141--150, May 2007.

Digital Library

Google Scholar

[11]

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, November 1999.

Google Scholar

[12]

J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81--106, March 1986.

Digital Library

Google Scholar

Cited By

View all

Zhang HSantos AFreire JDemartini GZuccon GCulpepper JHuang ZTong H(2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482427
Fröbe MBevendorff JGienapp LVölske MStein BPotthast MHagen MDiaz FShah CSuel TCastells PJones RSakai T(2021)CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common CrawlProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463246(2398-2404)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463246
Deshmukh SChittekar P(2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
https://doi.org/10.1109/ICOEI48184.2020.9142922
Show More Cited By

Index Terms

URL normalization for de-duplication of web pages
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Learning URL patterns for webpage de-duplication
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for ...
Enhancing URL Normalization Using Metadata of Web Pages
ICCEE '08: Proceedings of the 2008 International Conference on Computer and Electrical Engineering

In this paper, we present our proposed method of incorporating metadata of Web pages to identify equivalent URLs in addition to the standard URL normalization methodology. The metadata considered are the page size and the body text of Web pages. These ...
A novel crawling algorithm for web pages
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Crawler is a main component of search engines. In search engines, crawler part is responsible for discovering and downloading web pages. No search engine can cover whole of the web, thus it has to focus on the most valuable web pages. Several Crawling ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

November 2009

2162 pages

ISBN:9781605585123

DOI:10.1145/1645953

General Chairs:
David Cheung
University of Hong Kong, Hong Kong
,
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Wesley Chu
UCLA, USA
,
Xiaohua Hu
Drexel University, USA
,
Jimmy Lin
University of Maryland, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

CIKM '09

Sponsor:

CIKM '09: Conference on Information and Knowledge Management

November 2 - 6, 2009

Hong Kong, China

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
527
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhang HSantos AFreire JDemartini GZuccon GCulpepper JHuang ZTong H(2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482427
Fröbe MBevendorff JGienapp LVölske MStein BPotthast MHagen MDiaz FShah CSuel TCastells PJones RSakai T(2021)CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common CrawlProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463246(2398-2404)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463246
Deshmukh SChittekar P(2020)Removing Dust By Metacrawler2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142922(540-544)Online publication date: Jun-2020
https://doi.org/10.1109/ICOEI48184.2020.9142922
Arian MAbutbul EAharon MKoren YSomekh OStram RZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)Feature Enhancement via User Similarities Networks for Improved Click Prediction in Yahoo Gemini NativeProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357821(2557-2565)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3357821
Chittekar PDeshmukh S(2019)Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC.2019.8819753(699-701)Online publication date: Mar-2019
https://doi.org/10.1109/ICCMC.2019.8819753
Rane PDalal M(2018)iDUSTER: Improved Method for Removing DUST Based on Efficient Multiple Sequence Alignment Technique2018 International Conference on Inventive Research in Computing Applications (ICIRCA)10.1109/ICIRCA.2018.8597326(1450-1454)Online publication date: Jul-2018
https://doi.org/10.1109/ICIRCA.2018.8597326
Langhi JJadhav S(2018)Parallel Crawling for Detection and Removal of DUST Using DUSTER2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)10.1109/ICCUBEA.2018.8697837(1-5)Online publication date: Aug-2018
https://doi.org/10.1109/ICCUBEA.2018.8697837
Punj DDixit A(2017)Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHPInternational Journal of Rough Sets and Data Analysis10.4018/IJRSDA.20170101064:1(95-110)Online publication date: Jan-2017
https://doi.org/10.4018/IJRSDA.2017010106
Carvalho Cde Moura EVeloso AZiviani N(2017)Website replica detection with distant supervisionInformation Retrieval Journal10.1007/s10791-017-9320-z21:4(253-272)Online publication date: 29-Nov-2017
https://doi.org/10.1007/s10791-017-9320-z
Morichetta ABocchi EMetwalley HMellia M(2016)CLUE: Clustering for Mining Web URLs2016 28th International Teletraffic Congress (ITC 28)10.1109/ITC-28.2016.146(286-294)Online publication date: Sep-2016
https://doi.org/10.1109/ITC-28.2016.146
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Learning URL patterns for webpage de-duplication

Enhancing URL Normalization Using Metadata of Web Pages

A novel crawling algorithm for web pages

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations