Abstract
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. General-purpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-depth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce metrics to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of identifying good hubs and exploiting link contexts based on tag trees for accelerating the crawl and improving the overall results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., Rajagopalan, S.: Automatic resource list compilation by analyzing hyperlink structure and associated text. In: WWW7 (1998)
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW2002, Hawaii (May 2002)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. In: WWW8 (May 1999)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB 2000, Cairo, Egypt (2000)
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks 30(1-7), 161–172 (1998)
De Bra, P.M.E., Post, R.D.J.: Information retrieval in the World Wide Web: Making client-based searching feasible. In: Proc. 1st International World Wide Web Conference (1994)
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm — An application: Tailored Web site mapping. In: WWW7 (1998)
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Lawrence, S., Giles, C.L.: Accessibility of information on the Web. Nature 400, 107–109 (1999)
Menczer, F., Belew, R.K.: Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39(2-3), 203–242 (2000)
Menczer, F., Pant, G., Ruiz, M., Srinivasan, P.: Evaluating topic-driven Web crawlers. In: Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (2001)
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. In: To appear in ACM Trans. on Internet Technologies (2003), http://dollar.biz.uiowa.edu/~fil/Papers/TOIT.pdf
Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)
Pant, G., Srinivasan, P., Menczer, F.: Exploration versus exploitation in topic driven crawlers. In: WWW 2002 Workshop on Web Dynamics (2002)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
RaviKumar, S., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the Web graph. In: FOCS, pp. 57–65 (November 2000)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Information Retrieval (Submitted, 2003), http://dollar.biz.uiowa.edu/~fil/Papers/crawl_framework.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pant, G., Menczer, F. (2003). Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2003. Lecture Notes in Computer Science, vol 2769. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45175-4_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-45175-4_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40726-3
Online ISBN: 978-3-540-45175-4
eBook Packages: Springer Book Archive