[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

STEM: a suffix tree-based method for web data records extraction

Published: 01 May 2018 Publication History

Abstract

To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.

References

[1]
Laender A, Riberro-Neto B, Silva A, Teixira J (2002) A brief survey of web data extraction tools. In: SIGMOD record, vol 31(2). ACM, New York, pp 84---93
[2]
Chang C-H, Kayed M, Girgis MR, Shaala KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng. 18(10):1411---1428
[3]
Sleiman H, Corchuelo R et al (2013) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960---1981
[4]
Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl Based Syst 70:301---323
[5]
Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser L (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the 18th international conference on World Wide Web (WWW). ACM, New York, pp 981---990
[6]
Su W, Wang J, Wang J, Lochovsky FH, Liu Y (2012) Combing tag and value similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186---1200
[7]
Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 601---606
[8]
Baumgartner R, Gottlob G, Herzog M (2009) Scalable web data extraction for online market intelligence. Proc VLDB Endow 2(2):1512---1523
[9]
Liu L, Pu C, Han W (2000) Xwrap: an xml-enabled wrapper construction system for web information sources. In: Proceedings of the IEEE 16th international conference on data engineering, pp 611---621
[10]
Gulhane P, Madaan A, Mehta R, et al (2011) Web-scale information extraction with vertex. In: ICDE, pp 209---220
[11]
Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in knowitall (preliminary results). In: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100---110
[12]
Gupta S, Kaiser G, Neistadt D, Grimm P (2003) Dom-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207---214
[13]
Gupta S, Kaiser GE, Grimm P, Chiang MF, Starren J (2005) Automating content extraction of html documents. In: Proceedings of the 14th international conference on World Wide Web, vol 8, no. 2, pp 179---224, 6
[14]
Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceedings of the 18th international conference on information and knowledge management (CIKM). ACM, New York, pp 47---56
[15]
Dalvi N, Kumar R, Soliman M (2011) Automatic wrappers for large scale web extraction. Proc VLDB Endow 4(4):219---230
[16]
Buttler D, Liu L, Pu C (2001) A fully automated object extraction system for the world wide web. In: Proceedings of IEEE 21st international conference on distributed computing systems, pp 361---370
[17]
Chang C, Lui S (2001) Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th international conference on World Wide Web. ACM, New York, pp 681---688
[18]
Kayed M (2012) Peer matrix alignment: a new algorithm. In: Advances in knowledge discovery and data mining, pp 268---279
[19]
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 76---85
[20]
Liu B, Zhai Y (2005) Net-a system for extracting web data from flat and nested data records. In: WISE, vol 2005. Springer, pp 487---495
[21]
Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: Proceedings of ACM SIGMOD international conference on management of data, pp 337---348
[22]
Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614---1628
[23]
Zhao H, Meng W, Yu C (2006) Automatic extraction of dynamic record sections from search engine result pages. In: Proceedings of the 32nd international conference on very large data bases, pp 989---1000
[24]
Bing L, Lam W, Gu Y (2011) Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 1265---1274
[25]
Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B (2007) Towards domain-independent information extraction from web tables. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 71---80
[26]
Cai D, Yu S, Wen J-R, Ma W-Y (2003) Extracting content structure for web pages based on visual representation. In: Web technologies and applications, pp 406---417
[27]
Simon K, Lausen G (2005) Viper: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 381---388
[28]
Bing L, Lam W, Wong T-L (2013) Robust detection of semi-structured web records using a dom structure-knowledge-driven model. ACM Trans Web 7(4):21:1---21:32
[29]
Zhao H, Meng W, Wu Z, Raghavan V, Yu C (2005) Fully automatic wrapper generation for search engines. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 66---75
[30]
Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447---460
[31]
Fumarola F, Weninger T, Barber R, Malerba D, Han J (2011) Hylien: a hybrid approach to general list extraction on the web. In: Proceedings of the 20th international conference on World Wide Web (WWW), pp 35---36
[32]
Furche T, Gottlob G, Grasso G, Guo X, Orsi G, Schallhart C, Wang C (2014) Diadem: thousands of websites to a single database. Proc VLDB Endow 7(14):1845---1856
[33]
Crescenzi V, Merialdo P, Qiu D (2013) Alfred: crowd assisted data extraction. In: Proceedings of the 22th international conference on World Wide Web (WWW), pp 297---300
[34]
Kondreddi SK, Triantafillou P, Weikum G (2014) Combining information extraction and human computing for crowdsourced knowledge acquisition. In: IEEE 30th international conference on data engineering, pp 988---999
[35]
Pasternack J, Roth D (2009) Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th international conference on World Wide Web (WWW), pp 971---980
[36]
Weninger T, Hsu WH (2008) Text extraction from the web via text-to-tag ratio. In: 19th international workshop on database and expert systems application, pp 23---28
[37]
Weninger T, Hsu WH, Han J (2010) Cetr: content extraction via tag ratios. In: WWW, pp 971---980
[38]
Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: SIGIR, pp 245---254
[39]
Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th international conference on World Wide Web (WWW), pp 1264---1274
[40]
Song D, Sun F, Liao L (2015) A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl Inf Syst 42(1):75---96
[41]
Chiu D-Y, Wu Y-H, Chen A (2009) Efficient frequent sequence mining by a dynamic strategy switching algorithm. VLDB J 18(1):303---327
[42]
Loh W-K, Ahn H (2011) A storage-efficient suffix tree construction algorithm for human genome sequences. IEICE Trans 94---D(12):2557---2560
[43]
Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: ICCN, pp 215---224
[44]
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng, vol 16
[45]
Xie X, Fang Y, Zhang Z, Li L (2012) Extracting data records from web using suffix tree. In: Proceedings of the 18th SIGKDD workshop on mining data semantics. ACM
[46]
Maaß M (1999) Suffix trees and their applications. Ferienakademie
[47]
Grossi R, Italiano GF (1993) Suffix trees and their applications in string algorithms. In: Proceedings of the 1st south American workshop on string processing, pp 57---76
[48]
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 46---54
[49]
Chim H, Deng X (2007) A new suffix tree similarity measure for document clustering. In Proceedings of the 16th international conference on World Wide Web. ACM, pp 121---130
[50]
Fang Y, Zhang H, Ye Y, Li X (2014) Detecting hot topics from twitter: a multiview approach. J Inf Sci 40(5):578---593
[51]
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249---260
[52]
Farach M (1997) Optimal suffix tree construction with large alphabets. In: FOCS, pp 137---143
[53]
Greenberg RI (2003) Bounds on the number of longest common subsequences. CoRR, vol. cs.DM/0301030. {Online}. Available: http://arxiv.org/abs/cs.DM/0301030
[54]
Knuth DE, Morris JH, Pratt VR (1977) Fast pattern matching in strings. SAIM J Comput 6(2):323---350
[55]
Yamada Y, Craswell N, Nakatoh T, Hirokawa S (2004) Testbed for information extraction from deep web. In: Proceedings of the 13th international conference on World Wide Web (WWW). ACM, New York, pp 346---347
[56]
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31---88

Cited By

View all
  • (2024)All in One Place: Ensuring Usable Access to Online Shopping Items for Blind UsersProceedings of the ACM on Human-Computer Interaction10.1145/36646398:EICS(1-25)Online publication date: 17-Jun-2024
  • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
  • (2023)Enabling Efficient Web Data-Record Interaction for People with Visual Impairments via Proxy InterfacesACM Transactions on Interactive Intelligent Systems10.1145/357936413:3(1-27)Online publication date: 10-Jan-2023
  • Show More Cited By
  1. STEM: a suffix tree-based method for web data records extraction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Knowledge and Information Systems
    Knowledge and Information Systems  Volume 55, Issue 2
    May 2018
    249 pages

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 01 May 2018

    Author Tags

    1. Data Record pattern
    2. HTML tag path
    3. Suffix tree
    4. Web data extraction

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)All in One Place: Ensuring Usable Access to Online Shopping Items for Blind UsersProceedings of the ACM on Human-Computer Interaction10.1145/36646398:EICS(1-25)Online publication date: 17-Jun-2024
    • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
    • (2023)Enabling Efficient Web Data-Record Interaction for People with Visual Impairments via Proxy InterfacesACM Transactions on Interactive Intelligent Systems10.1145/357936413:3(1-27)Online publication date: 10-Jan-2023
    • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
    • (2022)Customizable Tabular Access to Web Data Records for Convenient Low-vision Screen Magnifier InteractionACM Transactions on Accessible Computing10.1145/351704415:2(1-22)Online publication date: 19-May-2022
    • (2021)Semantic table-of-contents for efficient web screen readingProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442066(1941-1949)Online publication date: 22-Mar-2021
    • (2020)TableView: Enabling Efficient Access to Web Data Records for Screen-Magnifier UsersProceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3373625.3417030(1-12)Online publication date: 26-Oct-2020

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media