[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Mining Web Pages for Data Records

Published: 01 November 2004 Publication History

Abstract

Much information on the Web is contained in regularly structured objects, or data records. Data records often present their host pages' essential information, such as lists of products and services. Mining data records to extract this information can help you provide value-added services. Existing approaches to data extraction on the Web include supervised learning and automatic techniques. Supervised learning requires substantial human effort, and current automatic techniques provide poor results. To solve this problem, the MDR (mining data records) system exploits two key observations about the layout of data records in Web pages and employs a string-matching algorithm. Experiments show that this new automatic technique significantly outperforms existing methods. In addition, it mines both contiguous and noncontiguous data records.

References

[1]
R. Baeza-Yates, "Algorithms for String Matching: A Survey," ACM SIGIR Forum, vol. 23, nos. 3–4, 1989, pp. 34–58.
[2]
D. Embley, Y. Jiang, and Y. Ng, "Record-Boundary Discovery in Web Documents," Proc. ACM Int'l Conf. Management of Data (SIGMOD 99), ACM Press, 1999, pp. 467–478.
[3]
D. Buttler, L. Liu, and C. Pu, "A Fully Automated Extraction System for the World Wide Web," Proc. 21st Int'l Conf. Distributed Computing Systems (ICDCS 01), IEEE CS Press, 2001, pp. 361–370.
[4]
C.-H. Chang and S.-L. Lui, "IEPAD: Information Extraction Based on Pattern Discovery," Proc. 10th Int'l Conf. World Wide Web (WWW 01), ACM Press, 2001, pp. 681–688.
[5]
K. Lerman, C. Knoblock, and S. Minton, "Automatic Data Extraction from Lists and Tables in Web Sources," Proc. IJCAI 2001 Workshop Adaptive Text Extraction and Mining, 2001;
[6]
V. Crescenzi, G. Mecca, and P. Merialdo, "RoadRunner: Towards Automatic Data Extraction from Large Web Sites," Proc. 27th Int'l Conf. Very Large Data Bases (VLDB 01), Morgan Kaufmann, 2001, pp. 109–118.
[7]
C. Knoblock and A. Levy, eds.Proc. 1998 Workshop AI and Information Integration, AAAI Press, 1998;
[8]
N. Kushmerick, D. Weld, and R. Doorenbos, "Wrapper Induction for Information Extraction," Proc. 15th Int'l Joint Conf. Artificial Intelligence (IJCAI 97), Morgan Kaufmann, 1997, pp. 729–735.
[9]
C.-N. Hsu and M.-T. Dung, "Generating Finite-State Transducers for Semi-structured Data Extraction from the Web," Information Systems, vol. 23, no. 8, 1998, pp. 521–538.
[10]
I. Muslea, S. Minton, and C. Knoblock, "A Hierarchical Approach to Wrapper Induction," Proc. 3rd Int'l Conf. Autonomous Agents (Agents 99), ACM Press, 1999, pp. 190–197.
[11]
W. Cohen, M. Hurst, and L. Jensen, "A Flexible Learning System for Wrapping Tables and Lists in HTML Documents," Proc. 11th Int'l Conf. World Wide Web (WWW 02), ACM Press, 2002, pp. 232–241.

Cited By

View all
  • (2012)The HiLeX system for semantic information extractionTransactions on Large-Scale Data- and Knowledge-Centered Systems V10.5555/2184170.2184175(91-125)Online publication date: 1-Jan-2012
  • (2009)RENS --- Enabling a Robot to Identify a PersonProceedings of the 2nd International Conference on Intelligent Robotics and Applications10.1007/978-3-642-10817-4_5(43-54)Online publication date: 16-Dec-2009
  • (2007)Extraction of user-defined data blocks using the regularity of dynamic web pagesProceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications10.5555/1777454.1777470(123-133)Online publication date: 21-Aug-2007
  • Show More Cited By
  1. Mining Web Pages for Data Records

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE Intelligent Systems
    IEEE Intelligent Systems  Volume 19, Issue 6
    November 2004
    93 pages

    Publisher

    IEEE Educational Activities Department

    United States

    Publication History

    Published: 01 November 2004

    Author Tags

    1. Web data
    2. Web data extraction
    3. Web mining
    4. data mining
    5. databases

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2012)The HiLeX system for semantic information extractionTransactions on Large-Scale Data- and Knowledge-Centered Systems V10.5555/2184170.2184175(91-125)Online publication date: 1-Jan-2012
    • (2009)RENS --- Enabling a Robot to Identify a PersonProceedings of the 2nd International Conference on Intelligent Robotics and Applications10.1007/978-3-642-10817-4_5(43-54)Online publication date: 16-Dec-2009
    • (2007)Extraction of user-defined data blocks using the regularity of dynamic web pagesProceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications10.5555/1777454.1777470(123-133)Online publication date: 21-Aug-2007
    • (2006)Automated extraction of hit numbers from search result pagesProceedings of the 7th international conference on Advances in Web-Age Information Management10.1007/11775300_7(73-84)Online publication date: 17-Jun-2006

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media