[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1378889.1378952acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Perception-oriented online news extraction

Published: 16 June 2008 Publication History

Abstract

A novel online news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies online news content. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as Tree Edit Distance and Visual Wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by Tree Edit Distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception-oriented Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.

References

[1]
Chen, J., Zhou, B., Shi, J., Zhang, H., & Wu, Q. 2001. Function-based Object Model Towards Website Adaptation. Proc. of WWW-10. 587--596.
[2]
Freitag, D., & Kushmerick, N. 2000. Boosted wrapper induction. AAAI/IAAI 2000. 577--583.
[3]
Geng, J. and Yang, J. 2004. Automatic extraction and integration of bibliographic information on the Web. IDEAS '04. 193--04.
[4]
Gu, X., Chen, J., Ma, W., & Chen, G. 2002. Visual Based Content Understanding towards Web Adaptation. 2nd Intl. Conf. on Adaptive Hypermedia and Adaptive Web Based Systems. 164--173.
[5]
Laender, A. H. F.; Ribeiro-Neto, B. A.; da Silva, A. S.; and Teixeira, J. S. 2002. A brief survey of web data extraction tools. SIGMOD Record 31(2):84--93.
[6]
Muslea, I., Minton, S., Knoblock, C. 2001. Hierarchical Wrapper Induction for Semistructured Information Sources. Journal of Autonomous Agents and Multi-Agent Systems, 4(1/2), 93--114.
[7]
Reis, D. C., Golgher, P. B., Silva, A. S., and Laender, A. F. 2004. Automatic web news extraction using tree edit distance. In WWW2004, 502--511.
[8]
Skounakis, M., Craven, M., & Ray, S. 2003. Hierarchical hidden markov models for information extraction. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence.
[9]
Zheng S., Song R., and Wen J. 2007. Template-independent news extraction based on visual consistency. AAAI-2007, 1507--1513.
[10]
Zhong P. and Chen J. 2008. Web Information Extraction Using Web-Specific Features, Journal of Digital Information Management (to appear).

Cited By

View all
  • (2023)Newsgist: video generation from news storiesAutomatika10.1080/00051144.2023.224177464:4(1026-1037)Online publication date: 2-Aug-2023
  • (2022)Collaborative Approach Toward Information Retrieval System to Get Relevant News Articles Over Web: IRS-WebComputational Intelligence and Data Analytics10.1007/978-981-19-3391-2_35(461-474)Online publication date: 2-Sep-2022
  • (2018)CADEN: A Context-Aware Deep Embedding Network for Financial Opinions Mining2018 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2018.00091(757-766)Online publication date: Nov-2018
  • Show More Cited By

Index Terms

  1. Perception-oriented online news extraction

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
      June 2008
      490 pages
      ISBN:9781595939982
      DOI:10.1145/1378889
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 June 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. information extraction
      2. online news
      3. web

      Qualifiers

      • Research-article

      Conference

      JCDL08
      JCDL08: Joint Conference on Digital Libraries
      June 16 - 20, 2008
      PA, Pittsburgh PA, USA

      Acceptance Rates

      JCDL '08 Paper Acceptance Rate 33 of 117 submissions, 28%;
      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 07 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Newsgist: video generation from news storiesAutomatika10.1080/00051144.2023.224177464:4(1026-1037)Online publication date: 2-Aug-2023
      • (2022)Collaborative Approach Toward Information Retrieval System to Get Relevant News Articles Over Web: IRS-WebComputational Intelligence and Data Analytics10.1007/978-981-19-3391-2_35(461-474)Online publication date: 2-Sep-2022
      • (2018)CADEN: A Context-Aware Deep Embedding Network for Financial Opinions Mining2018 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2018.00091(757-766)Online publication date: Nov-2018
      • (2016)Specification and discovery of web patternsInformation Sciences: an International Journal10.1016/j.ins.2015.08.052328:C(528-545)Online publication date: 20-Jan-2016
      • (2015)Extracting news content with visual unit of web pages2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)10.1109/SNPD.2015.7176203(1-5)Online publication date: Jun-2015
      • (2015)Extracting News from Server Side Databases by Query InterfacesJournal of Computer Information Systems10.1080/08874417.2014.1164568654:2(57-65)Online publication date: 10-Dec-2015
      • (2012)Web Interface Interpretation Using Graph GrammarsIEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews10.1109/TSMCC.2011.217133542:4(590-602)Online publication date: 1-Jul-2012
      • (2012)Automatic web page segmentation and information extraction using conditional random fieldsProceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD.2012.6221840(334-340)Online publication date: May-2012
      • (2009)An automatic web news article contents extraction system based on RSS feedsJournal of Web Engineering10.5555/2011294.20112978:3(268-284)Online publication date: 1-Sep-2009
      • (2009)A fast and simple method for extracting relevant content from news webpagesProceedings of the 18th ACM conference on Information and knowledge management10.1145/1645953.1646204(1685-1688)Online publication date: 2-Nov-2009
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media