[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2245276.2231954acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

An architecture-centered framework for developing blog crawlers

Published: 26 March 2012 Publication History

Abstract

Blogs have become interesting tools for knowledge generation and sharing. As a matter of fact, the activity on blogs doubles every two hundred days. Numerous applications could make use of this massive daily information in order to find out interesting interpretations. However, the dynamic nature of the blogosphere hinders the manual information extraction from it, promoting the development of new automated approaches. In this paper, we propose a component-based framework to create blog crawlers based on software architecture. This framework provides useful services for the blog analysis, including preprocessing, indexing, content extraction, classification, and tag recommendation. In addition, we report a case study represented by a blog recommendation system, which helps student interactions in educational forums. This research work also aims to demonstrate the effort reduction when creating an application for blog analysis caused by the proposed framework. Finally other aspects of the developed application, such as the system evolution impact, reusability, and instantiation cost are qualitatively discussed.

References

[1]
I. I. Bittencourt, E. de Barros Costa, M. Silva, and E. Soares. A computational model for developing semantic web-based educational systems. Knowl.-Based Syst., 22(4): 302--315, 2009.
[2]
R. Blood. How blogging software reshapes the online community. Commun. ACM, 47(12): 53--55, 2004.
[3]
M. E. Fayad, D. C. Schmidt, and R. E. Johnson. Building application frameworks: object-oriented foundations of framework design. John Wiley & Sons, Inc., New York, NY, USA, 1999.
[4]
C. Fellbaum, editor. WordNet: an electronic lexical database. MIT Press, 1998.
[5]
R. Ferreira, R. J. Lima, I. I. Bittencourt, D. M. Filho, O. Holanda, E. Costa, F. Freitas, and L. Melo. A framework for developing context-based blog crawlers. Proceedings of the IADIS International Conference on WWW/Internet, pages 120--126, 2010.
[6]
L. A. Gayard, C. M. F. Rubira, and P. A. de Castro Guerra. COSMOS*: a COmponent System MOdel for Software Architectures. Technical Report IC-08-04, Institute of Computing, University of Campinas, February 2008.
[7]
N. S. Glance, M. Hurst, and T. Tomokiyo. Blogpulse: Automated trend discovery for weblogs. In WWW 2004 workshop on the weblogging ecosystem: aggregation, analysis and dynamics, 2004.
[8]
H. Gomaa. Designing Software Product Lines with UML: From Use Cases to Pattern-Based Software Architectures. Addison Wesley, 2004.
[9]
A. Hotho, A. Nürnberger, and G. Paaß. A brief survey of text mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 20(1): 19--62, May 2005.
[10]
L. Jiang, D. Wang, Z. Cai, and X. Yan. Survey of improving naive bayes for classification. In Proceedings of the 3rd international conference on Advanced Data Mining and Applications, ADMA '07, pages 134--145, Berlin, Heidelberg, 2007. Springer-Verlag.
[11]
M. Joshi. Blogharvest: Blog mining and search framework. In In: Proc. of the International Conf. on Management of Data COMAD, 2006.
[12]
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pages 441--450, New York, NY, USA, 2010. ACM.
[13]
Y. S. Li Baoli and L. Qin. An improved k-nearest neighbor algorithm for text categorization. Proceedings of the 20th international conference on computer processing of oriental languages, 2003.
[14]
F. P. Miller, A. F. Vandome, and J. McBrewster. Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau? Levenshtein distance, Spell checker, Hamming distance. Alpha Press, 2009.
[15]
T. Mitchell. Maching Learning. McGraw-Hill education, 1997.
[16]
D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL '02, pages 46--55, New York, NY, USA, 2002. ACM.
[17]
A. Rosenbloom. The blogosphere - introduction. Commun. ACM, 47(12): 30--33, 2004.
[18]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24: 513--523, August 1988.
[19]
S. Tan. An effective refinement strategy for knn text classifier. Expert Syst. Appl., 30: 290--298, February 2006.
[20]
T. Weninger and W. H. Hsu. Text extraction from the web via text-to-tag ratio. In Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, pages 23--28, Washington, DC, USA, 2008. IEEE Computer Society.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing
March 2012
2179 pages
ISBN:9781450308571
DOI:10.1145/2245276
  • Conference Chairs:
  • Sascha Ossowski,
  • Paola Lecca
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. architecture
  2. blog crawler
  3. component-based development
  4. framework
  5. recommendation system

Qualifiers

  • Research-article

Conference

SAC 2012
Sponsor:
SAC 2012: ACM Symposium on Applied Computing
March 26 - 30, 2012
Trento, Italy

Acceptance Rates

SAC '12 Paper Acceptance Rate 270 of 1,056 submissions, 26%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 163
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media