More Web Proxy on the site http://driver.im/

article

STEM: a suffix tree-based method for web data records extraction

Authors:

Xiaofeng Zhang,

Zhiqiang ZhangAuthors Info & Claims

Knowledge and Information Systems, Volume 55, Issue 2

Pages 305 - 331

https://doi.org/10.1007/s10115-017-1062-0

Published: 01 May 2018 Publication History

Abstract

To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.

References

[1]

Laender A, Riberro-Neto B, Silva A, Teixira J (2002) A brief survey of web data extraction tools. In: SIGMOD record, vol 31(2). ACM, New York, pp 84---93

Digital Library

[2]

Chang C-H, Kayed M, Girgis MR, Shaala KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng. 18(10):1411---1428

Digital Library

[3]

Sleiman H, Corchuelo R et al (2013) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960---1981

Digital Library

[4]

Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl Based Syst 70:301---323

Digital Library

[5]

Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser L (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the 18th international conference on World Wide Web (WWW). ACM, New York, pp 981---990

Digital Library

[6]

Su W, Wang J, Wang J, Lochovsky FH, Liu Y (2012) Combing tag and value similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186---1200

Digital Library

[7]

Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 601---606

Digital Library

[8]

Baumgartner R, Gottlob G, Herzog M (2009) Scalable web data extraction for online market intelligence. Proc VLDB Endow 2(2):1512---1523

Digital Library

[9]

Liu L, Pu C, Han W (2000) Xwrap: an xml-enabled wrapper construction system for web information sources. In: Proceedings of the IEEE 16th international conference on data engineering, pp 611---621

Digital Library

[10]

Gulhane P, Madaan A, Mehta R, et al (2011) Web-scale information extraction with vertex. In: ICDE, pp 209---220

Digital Library

[11]

Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in knowitall (preliminary results). In: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100---110

Digital Library

[12]

Gupta S, Kaiser G, Neistadt D, Grimm P (2003) Dom-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207---214

Digital Library

[13]

Gupta S, Kaiser GE, Grimm P, Chiang MF, Starren J (2005) Automating content extraction of html documents. In: Proceedings of the 14th international conference on World Wide Web, vol 8, no. 2, pp 179---224, 6

Digital Library

[14]

Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceedings of the 18th international conference on information and knowledge management (CIKM). ACM, New York, pp 47---56

Digital Library

[15]

Dalvi N, Kumar R, Soliman M (2011) Automatic wrappers for large scale web extraction. Proc VLDB Endow 4(4):219---230

Digital Library

[16]

Buttler D, Liu L, Pu C (2001) A fully automated object extraction system for the world wide web. In: Proceedings of IEEE 21st international conference on distributed computing systems, pp 361---370

Digital Library

[17]

Chang C, Lui S (2001) Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th international conference on World Wide Web. ACM, New York, pp 681---688

Digital Library

[18]

Kayed M (2012) Peer matrix alignment: a new algorithm. In: Advances in knowledge discovery and data mining, pp 268---279

Digital Library

[19]

Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 76---85

Digital Library

[20]

Liu B, Zhai Y (2005) Net-a system for extracting web data from flat and nested data records. In: WISE, vol 2005. Springer, pp 487---495

Digital Library

[21]

Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: Proceedings of ACM SIGMOD international conference on management of data, pp 337---348

Digital Library

[22]

Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614---1628

Digital Library

[23]

Zhao H, Meng W, Yu C (2006) Automatic extraction of dynamic record sections from search engine result pages. In: Proceedings of the 32nd international conference on very large data bases, pp 989---1000

Digital Library

[24]

Bing L, Lam W, Gu Y (2011) Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 1265---1274

Digital Library

[25]

Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B (2007) Towards domain-independent information extraction from web tables. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 71---80

Digital Library

[26]

Cai D, Yu S, Wen J-R, Ma W-Y (2003) Extracting content structure for web pages based on visual representation. In: Web technologies and applications, pp 406---417

Digital Library

[27]

Simon K, Lausen G (2005) Viper: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 381---388

Digital Library

[28]

Bing L, Lam W, Wong T-L (2013) Robust detection of semi-structured web records using a dom structure-knowledge-driven model. ACM Trans Web 7(4):21:1---21:32

Digital Library

[29]

Zhao H, Meng W, Wu Z, Raghavan V, Yu C (2005) Fully automatic wrapper generation for search engines. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 66---75

Digital Library

[30]

Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447---460

Digital Library

[31]

Fumarola F, Weninger T, Barber R, Malerba D, Han J (2011) Hylien: a hybrid approach to general list extraction on the web. In: Proceedings of the 20th international conference on World Wide Web (WWW), pp 35---36

Digital Library

[32]

Furche T, Gottlob G, Grasso G, Guo X, Orsi G, Schallhart C, Wang C (2014) Diadem: thousands of websites to a single database. Proc VLDB Endow 7(14):1845---1856

Digital Library

[33]

Crescenzi V, Merialdo P, Qiu D (2013) Alfred: crowd assisted data extraction. In: Proceedings of the 22th international conference on World Wide Web (WWW), pp 297---300

Digital Library

[34]

Kondreddi SK, Triantafillou P, Weikum G (2014) Combining information extraction and human computing for crowdsourced knowledge acquisition. In: IEEE 30th international conference on data engineering, pp 988---999

[35]

Pasternack J, Roth D (2009) Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th international conference on World Wide Web (WWW), pp 971---980

Digital Library

[36]

Weninger T, Hsu WH (2008) Text extraction from the web via text-to-tag ratio. In: 19th international workshop on database and expert systems application, pp 23---28

Digital Library

[37]

Weninger T, Hsu WH, Han J (2010) Cetr: content extraction via tag ratios. In: WWW, pp 971---980

Digital Library

[38]

Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: SIGIR, pp 245---254

Digital Library

[39]

Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th international conference on World Wide Web (WWW), pp 1264---1274

Digital Library

[40]

Song D, Sun F, Liao L (2015) A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl Inf Syst 42(1):75---96

Digital Library

[41]

Chiu D-Y, Wu Y-H, Chen A (2009) Efficient frequent sequence mining by a dynamic strategy switching algorithm. VLDB J 18(1):303---327

Digital Library

[42]

Loh W-K, Ahn H (2011) A storage-efficient suffix tree construction algorithm for human genome sequences. IEICE Trans 94---D(12):2557---2560

[43]

Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: ICCN, pp 215---224

Digital Library

[44]

Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng, vol 16

Digital Library

[45]

Xie X, Fang Y, Zhang Z, Li L (2012) Extracting data records from web using suffix tree. In: Proceedings of the 18th SIGKDD workshop on mining data semantics. ACM

Digital Library

[46]

Maaß M (1999) Suffix trees and their applications. Ferienakademie

[47]

Grossi R, Italiano GF (1993) Suffix trees and their applications in string algorithms. In: Proceedings of the 1st south American workshop on string processing, pp 57---76

[48]

Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 46---54

Digital Library

[49]

Chim H, Deng X (2007) A new suffix tree similarity measure for document clustering. In Proceedings of the 16th international conference on World Wide Web. ACM, pp 121---130

Digital Library

[50]

Fang Y, Zhang H, Ye Y, Li X (2014) Detecting hot topics from twitter: a multiview approach. J Inf Sci 40(5):578---593

Digital Library

[51]

Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249---260

Digital Library

[52]

Farach M (1997) Optimal suffix tree construction with large alphabets. In: FOCS, pp 137---143

Digital Library

[53]

Greenberg RI (2003) Bounds on the number of longest common subsequences. CoRR, vol. cs.DM/0301030. {Online}. Available: http://arxiv.org/abs/cs.DM/0301030

[54]

Knuth DE, Morris JH, Pratt VR (1977) Fast pattern matching in strings. SAIM J Comput 6(2):323---350

[55]

Yamada Y, Craswell N, Nakatoh T, Hirokawa S (2004) Testbed for information extraction from deep web. In: Proceedings of the 13th international conference on World Wide Web (WWW). ACM, New York, pp 346---347

Digital Library

[56]

Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31---88

Digital Library

Cited By

Prakash YNayak ASunkara MJayarathna SLee HAshok V(2024)All in One Place: Ensuring Usable Access to Online Shopping Items for Blind UsersProceedings of the ACM on Human-Computer Interaction10.1145/36646398:EICS(1-25)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3664639
Prakash YSunkara MLee HJayarathna SAshok V(2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584049
Ferdous JLee HJayarathna SAshok V(2023)Enabling Efficient Web Data-Record Interaction for People with Visual Impairments via Proxy InterfacesACM Transactions on Interactive Intelligent Systems10.1145/357936413:3(1-27)Online publication date: 10-Jan-2023
https://dl.acm.org/doi/10.1145/3579364
Show More Cited By

STEM: a suffix tree-based method for web data records extraction
1. Information systems

Recommendations

AutoRM

A Web database typically responds to a query with a Web page, which encodes the query results into semi-structured data objects using HTML tags. We call such data objects Web data records or data records. Mining Web data records is very important for ...
ViDE: A Vision-Based Approach for Deep Web Data Extraction

Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a ...
The affix array data structure and its applications to RNA secondary structure analysis

Efficient string-processing in large data sets like complete genomes is strongly connected to the suffix tree and similar index data structures. With respect to complex structural string analysis like the search for RNA secondary structure patterns, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Knowledge and Information Systems

Knowledge and Information Systems Volume 55, Issue 2

May 2018

249 pages

ISSN:0219-1377

Issue’s Table of Contents

Copyright © Copyright © 2018 Springer-Verlag London Ltd., part of Springer Nature.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 May 2018

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Prakash YNayak ASunkara MJayarathna SLee HAshok V(2024)All in One Place: Ensuring Usable Access to Online Shopping Items for Blind UsersProceedings of the ACM on Human-Computer Interaction10.1145/36646398:EICS(1-25)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3664639
Prakash YSunkara MLee HJayarathna SAshok V(2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584049
Ferdous JLee HJayarathna SAshok V(2023)Enabling Efficient Web Data-Record Interaction for People with Visual Impairments via Proxy InterfacesACM Transactions on Interactive Intelligent Systems10.1145/357936413:3(1-27)Online publication date: 10-Jan-2023
https://dl.acm.org/doi/10.1145/3579364
Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Lee HAshok V(2022)Customizable Tabular Access to Web Data Records for Convenient Low-vision Screen Magnifier InteractionACM Transactions on Accessible Computing10.1145/351704415:2(1-22)Online publication date: 19-May-2022
https://dl.acm.org/doi/10.1145/3517044
Ferdous JUddin SAshok VHung CHong JBechini ASong E(2021)Semantic table-of-contents for efficient web screen readingProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442066(1941-1949)Online publication date: 22-Mar-2021
https://dl.acm.org/doi/10.1145/3412841.3442066
Lee HUddin SAshok V(2020)TableView: Enabling Efficient Access to Web Data Records for Screen-Magnifier UsersProceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3373625.3417030(1-12)Online publication date: 26-Oct-2020
https://dl.acm.org/doi/10.1145/3373625.3417030

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents