[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1363686.1363953acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

The impact of term selection in genre-aware focused crawling

Published: 16 March 2008 Publication History

Abstract

The genre-aware approach to focused crawling aims at crawling pages related to specific topics that can be expressed in terms of both genre and content information. Such an approach requires an expert to specify a set of terms that describe the genre and the content of the pages of interest. In this paper, we analyze the impact of term selection on this approach. Thus, we have performed an experimental study in which we vary the number of genre and content terms used in focused crawling processes aimed at crawling pages related to syllabi (genre) of computer science courses (subject) and sale offers (genre) of computer equipments (subject). This experimental study showed that a small set of terms selected by an expert is usually enough to produce good results. In addition, we propose and experimentally evaluate a strategy for semi-automatic generation of terms to be used in such an approach. The results of these experiments showed that such a strategy is very effective and provides a means to assist an expert in the task of specifying the sets of required terms.

References

[1]
G. T. Assis, A. H. F. Laender, M. A. Gonçalves and A. S. Silva. Exploiting Genre in Focused Crawling. In Proc. of the 14th Symposium on String Processing and Information Retrieval, Santiago, Chile, 2007, pp. 49--60.
[2]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, New York, 1999.
[3]
S. Chakrabarti, M. Berg and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. J. of Computer Networks, 31(11--16): 1623--1640, 1999.
[4]
P. M. E. De Bra and R. D. J. Post. Information Retrieval in the World Wide Web: Making Client-Based Searching Feasible. J. of Computer Networks and ISDN Systems, 27(2): 183--192, 1994.
[5]
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles and M. Gori. Focused Crawling Using Context Graphs. In Proc. of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, 2000, pp. 527--534.
[6]
K. Lagus and S. Kaski. Keyword Selection Method for Characterizing Text Document Maps. In Proc. of the 9th International Conference on Artificial Neural Networks, Edinburgh, UK, 1999, pp. 371--376.
[7]
A. McCallum, K. Nigam, J. Rennie and K. Seymore. Automating the Construction of Internet Portals with Machine Learning. J. of Information Retrieval, 3(2): 127--163, 2000.
[8]
F. Menczer, G. Pant and P. Srinivasan. Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology, 4(4): 378--419, 2004.
[9]
F. Menczer, G. Pant, P. Srinivasan and M. E. Ruiz. Evaluating Topic-driven Web Crawlers. In Proc. of the 24th Annual Int'l. ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, New Orleans, Louisiana, pp. 241--249.
[10]
G. Pant and F. Menczer. Topical Crawling for Business Intelligence. In Proc. of the 7th European Conference on Research and Advanced Technology for Digital Libraries, Trodheim, Norway, 2003, pp. 233--244.
[11]
G. Pant and P. Srinivasan. Link Contexts in Classifier-Guided Topical Crawlers. IEEE Transactions on Knowledge and Data Engineering, 18(1): 107--122, 2006.
[12]
G. Pant and P. Srinivasan. Learning to Crawl: Comparing Classification. ACM Transactions on Information Systems, 23(4): 430--462, 2005.
[13]
G. Pant, K. Tsioutsiouliklis, J. Johnson and C. L. Giles. Panorama: Extending Digital Libraries with Topical Crawlers. In Proc. of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Tucson, Arizona, 2004, pp. 142--150.
[14]
P. Srinivasan, F. Menczer and G. Pant. A General Evaluation Framework for Topical Crawlers. Journal of Information Retrieval, 8(3): 417--447, 2005.

Cited By

View all
  • (2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
  • (2013)A user-oriented web crawler for selectively acquiring online content in e-health researchBioinformatics10.1093/bioinformatics/btt57130:1(104-114)Online publication date: 29-Sep-2013
  • (2010)A Crawler for Local SearchProceedings of the 2010 Fourth International Conference on Digital Society10.1109/ICDS.2010.23(86-91)Online publication date: 10-Feb-2010

Index Terms

  1. The impact of term selection in genre-aware focused crawling

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SAC '08: Proceedings of the 2008 ACM symposium on Applied computing
    March 2008
    2586 pages
    ISBN:9781595937537
    DOI:10.1145/1363686
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 March 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. focused crawling
    2. terms selection
    3. web crawling

    Qualifiers

    • Research-article

    Conference

    SAC '08
    Sponsor:
    SAC '08: The 2008 ACM Symposium on Applied Computing
    March 16 - 20, 2008
    Fortaleza, Ceara, Brazil

    Acceptance Rates

    Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

    Upcoming Conference

    SAC '25
    The 40th ACM/SIGAPP Symposium on Applied Computing
    March 31 - April 4, 2025
    Catania , Italy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)DSDDProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482427(2527-2536)Online publication date: 26-Oct-2021
    • (2013)A user-oriented web crawler for selectively acquiring online content in e-health researchBioinformatics10.1093/bioinformatics/btt57130:1(104-114)Online publication date: 29-Sep-2013
    • (2010)A Crawler for Local SearchProceedings of the 2010 Fourth International Conference on Digital Society10.1109/ICDS.2010.23(86-91)Online publication date: 10-Feb-2010

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media