[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1526709.1526735acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Incorporating site-level knowledge to extract structured data from web forums

Published: 20 April 2009 Publication History

Abstract

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.

References

[1]
Big boards. http://directory.big--boards.com/, 2008.
[2]
R. Bunescu and R. J. Mooney. Collective information extraction with relational Markov networks. In Proc. 42nd ACL, pages 439--446, Barcelona, Spain, July 2004.
[3]
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Block--based Web search. In Proc. 27th SIGIR, pages 456--463, Sheffield, UK, July 2004.
[4]
R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang. iRobot: An intelligent crawler for Web forums. In Proc. 17th WWW, pages 447--456, Beijing, China, April 2008.
[5]
G. Cong, L. Wang, C.-Y. Lin, Y.-I. Song, and Y. Sun. Finding question-answer pairs from online forums. In Proc. 31st SIGIR, pages 467--474, Singapore, July 2008.
[6]
N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T. Tomokiyo. Deriving marketing intelligence from online discussion. In Proc. 11th SIGKDD, pages 419--428, Chicago, Illinois, USA, August 2005.
[7]
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.
[8]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th ICML, pages 282--289, Williams College, Williamstown, MA, USA, June 2001.
[9]
K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In Proc. SIGMOD, pages 119--130, Paris, France, June 2004.
[10]
K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
[11]
D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In Proc. 26th SIGIR, pages 235--242, Toronto, Canada, July 2003.
[12]
H. Poon and P. Domingos. Joint inference in information extraction. In Proc. 22nd AAAI, pages 913--918, Vancouver, Canada, July 2007.
[13]
M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62 (1-2):107--136, 2006.
[14]
P. Singla and P. Domingos. Discriminative training of markov logic networks. In Proc. 20nd AAAI, 2005.
[15]
P. Singla and P. Domingos. Entity resolution with Markov logic. In Proc. 6th ICDM, pages 572--582, Hong Kong, China, December 2006.
[16]
Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma. Exploring traversal strategy for Web forum crawling. In Proc. 31st SIGIR, pages 459--466, Singapore, July 2008.
[17]
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. 14th WWW, pages 76--85, Chiba, Japan, May 2005.
[18]
J. Zhang, M. S. Ackerman, and L. Adamic. Expertise networks in online communities: structure and algorithms. In Proc. 16th WWW, pages 221--230, Banff, Alberta, Canada, May 2007.
[19]
S. Zheng, R. Song, J.-R. Wen, and D. Wu. Joint optimization of wrapper generation and template detection. In Proc. 13th SIGKDD, pages 894--902, San Jose, California, USA, August 2007.
[20]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2D conditional random Fields for Web information extraction. In Proc. 24th ICML, pages 1044--1051, Corvalis, Oregon, USA, August 2005.
[21]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in Web data extraction. In Proc. 12th SIGKDD, pages 494--503, Philadelphia, PA, USA, August 2006.

Cited By

View all
  • (2021)A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media2021 International Symposium on Computer Science and Intelligent Controls (ISCSIC)10.1109/ISCSIC54682.2021.00028(95-101)Online publication date: Nov-2021
  • (2020)Integrating researchers’ scientific production information through OgmiosKnowledge and Information Systems10.1007/s10115-020-01479-8Online publication date: 16-Jun-2020
  • (2019)Automatic keyphrase extraction using word embeddingsSoft Computing10.1007/s00500-019-03963-y24:8(5593-5608)Online publication date: 29-Mar-2019
  • Show More Cited By

Index Terms

  1. Incorporating site-level knowledge to extract structured data from web forums

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '09: Proceedings of the 18th international conference on World wide web
        April 2009
        1280 pages
        ISBN:9781605584874
        DOI:10.1145/1526709

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 April 2009

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Markov logic networks (MLNS)
        2. information extraction
        3. site-level knowledge
        4. structured data
        5. web forums

        Qualifiers

        • Research-article

        Conference

        WWW '09
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)7
        • Downloads (Last 6 weeks)3
        Reflects downloads up to 09 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media2021 International Symposium on Computer Science and Intelligent Controls (ISCSIC)10.1109/ISCSIC54682.2021.00028(95-101)Online publication date: Nov-2021
        • (2020)Integrating researchers’ scientific production information through OgmiosKnowledge and Information Systems10.1007/s10115-020-01479-8Online publication date: 16-Jun-2020
        • (2019)Automatic keyphrase extraction using word embeddingsSoft Computing10.1007/s00500-019-03963-y24:8(5593-5608)Online publication date: 29-Mar-2019
        • (2018)Web Forum Retrieval and Text AnalyticsFoundations and Trends in Information Retrieval10.1561/150000006212:1(1-163)Online publication date: 3-Jan-2018
        • (2018)Extraction of Data from Mass Media Web SitesProgramming and Computing Software10.1134/S036176881805009244:5(344-352)Online publication date: 1-Sep-2018
        • (2017)Web forum crawling using text-based filters2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI)10.1109/ICPCSI.2017.8392037(1856-1859)Online publication date: Sep-2017
        • (2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
        • (2016)MAVEIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.257330228:9(2393-2406)Online publication date: 1-Sep-2016
        • (2016)Then and NowIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2015.23973954:1(35-46)Online publication date: 1-Jan-2016
        • (2016)Identifying the role of individual user messages in an online discussion and its use in thread retrievalJournal of the Association for Information Science and Technology10.1002/asi.2337367:2(276-288)Online publication date: 1-Feb-2016
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media