[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2492517.2500328acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

An analyst-adaptive approach to focused crawlers

Published: 25 August 2013 Publication History

Abstract

The paper presents a general methodology to implement a flexible Focused Crawler for investigation purposes, monitoring, and Open Source Intelligence (OSINT). The resulting tool is specifically aimed to fit the operational requirements of law-enforcement agencies and intelligence analyst. The architecture of the semantic Focused Crawler features static flexibility in the definition of desired concepts, used metrics, and crawling strategy; in addition, the method is capable to learn (and adapt to) the analyst's expectations at runtime. The user may instruct the crawler with a binary feedback (yes/no) about the current performance of the surfing process, and the crawling engine progressively refines the expected targets accordingly. The method implementation is based on an existing text-mining environment, integrated with semantic networks and ontologies. Experimental results witness the effectiveness of the adaptive mechanism.

References

[1]
J. J. Xu, H. Chen "Fighting organized crimes: using shortest-path algorithms to identify associations in criminal networks", Decision Support Systems, 2004, vol. 38, pp. 473--487.
[2]
S. Chakrabarti, M. Van den Berg, B. Dom, "Focused crawling: a new approach to topic-specific Web resource discovery", Computer Networks, 1999, vol. 31 No. 11, pp. 1623--1640.
[3]
S. Batsakis, E. G. Petrakis, E. Milios, "Improving the performance of focused web crawlers", Data & Knowledge Engineering, 2009, vol. 68, No. 10, pp. 1001--1013.
[4]
F. Menczer, G. Pant, P. Srinivasan, "Topical web crawlers: Evaluating adaptive algorithms", ACM Transactions on Internet Technology, 2004, vol. 4, No. 4, pp. 378--419.
[5]
P. De Bra, G. J. Houben, Y. Kornatzky, R. Post, "Information retrieval in distributed hypertexts", Proc. 4th RIAO Conf., Oct 1994, pp. 481--491.
[6]
G. Salton, A. Wong, C. S. Yang, "A vector space model for automatic indexing" Comm. of the ACM, 1975, vol. 18, No. 11, pp. 613--620.
[7]
M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, S. Ur, "The shark-search algorithm. An application: tailored Web site mapping", Computer Networks and ISDN Systems, 1998, vol. 30, No. 1, pp. 317--326.
[8]
M. Ehrig, A. Maedche, "Ontology-focused crawling of Web documents" Proc. 2003 ACM Symp. Applied Computing, Mar 2003, pp. 1174--1178.
[9]
A. Hliaoutakis, G. Varelas, E. Voutsakis, E. G. Petrakis, E. Milios, "Information retrieval by semantic similarity" Int. J. Semantic Web and Information Systems, 2006, vol. 2, No. 3, pp. 55--73.
[10]
A. Leoncini, F. Sangiacomo, S. Decherchi, P. Gastaldo, R. Zunino "Semantic Oriented Clustering of Documents", Proc. Int. Symp. Neural Networks ISNN 2011, May 2011, Part III, pp. 523--529.
[11]
G. Pant, P. Srinivasan, "Learning to crawl: Comparing classification schemes" ACM Transactions on Information Systems, 2005, vol. 23, No. 4, pp. 430--462.
[12]
J. Li, K. Furuse, K. Yamaguchi, "Focused crawling by exploiting anchor text using decision tree", 14th Int. conf. on World Wide Web, May 2005, pp. 1190--1191.
[13]
G. Pant, P. Srinivasan, "Link contexts in classifier-guided topical crawlers" IEEE Trans. Knowledge and Data Engineering, 2006, vol. 18, No. 1, pp. 107--122.
[14]
T. Fu, A. Abbasi, H. Chen, "A focused crawler for Dark Web forums", J. American Soc. Info. Science and Technol., 2010, vol. 61, No. 6, pp. 1213--1231.
[15]
C. C. Aggarwal, F. Al-garawi, P. S. Yu, "Intelligent Crawling on the World Wide Web with Arbitrary Predicates" WWW10, May 2001, pp. 96--105.
[16]
F. Sangiacomo, A. Leoncini, S. Decherchi, P. Gastaldo, R. Zunino "SeaLab Advanced Information Retrieval", Proc. IEEE Int. Conf. Semantic Computing ICSC 2010, Sept 2010, pp. 444--445.
[17]
P. Gastaldo, S. Decherchi, R. Zunino "K-means clustering for content-based document management" in A. Solanas and A. Martinez (Eds.), Advances in Artificial Intelligence for Privacy, Protection, and Security, World Scientific, 2009.
[18]
A. Leoncini, F. Sangiacomo, P. Gastaldo, R. Zunino "A semantic-based framework for summarization and page segmentation in web mining" in S. Sakurai (Ed.), Theory and Applications for Advanced Text Mining, InTech Publishing, 2012.
[19]
Vossen, P. (Ed.). (1998). EuroWordNet: a multilingual database with lexical semantic networks. Boston: Kluwer Academic.
[20]
H. Zhang, T. W. S. Chow, W. Liu, "Textual and visual content-based anti-phishing: a Bayesian approach" IEEE Trans. Neural Networks, Oct 2011, vol. 22, o. 10, pp. 1532--1546.
[21]
J. Kumar, N. Gupta, N. Sharma, P. Rawat, "A review of content based image classification using color clustering technique approach" Int. J. Emerging Technology and Advanced Engineering, vol. 3, No. 3, March 2013, pp. 922--926.
[22]
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray, "Visual categorization with bags of keypoints" Int. Workshop on Statistical Learning in Computer Vision, ECCV 2004, pp. 1--22.
[23]
H. Bay, T. Tuytelaars, and L. Van Gool. "SURF: Speeded up robust features", Proc. European Conference on Computer Vision, 2006.
[24]
A. Leoncini, F. Sangiacomo, S. Argentesi, R. Zunino, E. Cambria "Semantic Models for Style-based Text Clustering", Proc. IEEE Int. Conf. Semantic Computing ICSC 2011, Sept 2011, pp. 75--82.
[25]
R. T. Freeman, H. Yin, "Web Content management by self-organization" IEEE Trans. Neural Networks, Sept 2005, vol. 16, No. 5, pp. 1256--1268.
[26]
R. Zhang, A. I. Rudnicky "A large scale clustering scheme for kernel K-means", Proc. 16th Int. Conf. Pattern Recognition, 2002, vol. 4, pp. 289--292.
[27]
D. R. Radev, H. Jing, M. Stys, D. Tam, "Centroid-based summarization of multiple documents", Information Processing and Management, 2004, vol. 40, pp. 919--938.
[28]
T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features" Proc. Eur. Conf. Machine Learning, 1998.

Cited By

View all
  • (2024)A systematic review on research utilising artificial intelligence for open source intelligence (OSINT) applicationsInternational Journal of Information Security10.1007/s10207-024-00868-223:4(2911-2938)Online publication date: 1-Aug-2024
  • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
  • (2015)Real-time monitoring of Twitter traffic by using semantic networksProceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 201510.1145/2808797.2809371(966-969)Online publication date: 25-Aug-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASONAM '13: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
August 2013
1558 pages
ISBN:9781450322409
DOI:10.1145/2492517
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OSINT
  2. analyst-adaptation
  3. focused crawler

Qualifiers

  • Research-article

Conference

ASONAM '13
Sponsor:
ASONAM '13: Advances in Social Networks Analysis and Mining 2013
August 25 - 28, 2013
Ontario, Niagara, Canada

Acceptance Rates

Overall Acceptance Rate 116 of 549 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A systematic review on research utilising artificial intelligence for open source intelligence (OSINT) applicationsInternational Journal of Information Security10.1007/s10207-024-00868-223:4(2911-2938)Online publication date: 1-Aug-2024
  • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
  • (2015)Real-time monitoring of Twitter traffic by using semantic networksProceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 201510.1145/2808797.2809371(966-969)Online publication date: 25-Aug-2015
  • (2015)Content-Adaptive Analysis and Filtering of Microblogs Traffic for Event-Monitoring ApplicationsProceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems, Volume 110.1007/978-3-319-13359-1_13(155-170)Online publication date: 2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media