Abstract
To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.
Similar content being viewed by others
Notes
Without further explanation, the DOM tree nodes mentioned in this paper are element nodes.
For TB2 and TB3, the number of websites are not specified in [28], and so we put “–” there.
References
Laender A, Riberro-Neto B, Silva A, Teixira J (2002) A brief survey of web data extraction tools. In: SIGMOD record, vol 31(2). ACM, New York, pp 84–93
Chang C-H, Kayed M, Girgis MR, Shaala KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng. 18(10):1411–1428
Sleiman H, Corchuelo R et al (2013) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981
Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl Based Syst 70:301–323
Miao G, Tatemura J, Hsiung W-P, Sawires A, Moser L (2009) Extracting data records from the web using tag path clustering. In: Proceedings of the 18th international conference on World Wide Web (WWW). ACM, New York, pp 981–990
Su W, Wang J, Wang J, Lochovsky FH, Liu Y (2012) Combing tag and value similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186–1200
Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 601–606
Baumgartner R, Gottlob G, Herzog M (2009) Scalable web data extraction for online market intelligence. Proc VLDB Endow 2(2):1512–1523
Liu L, Pu C, Han W (2000) Xwrap: an xml-enabled wrapper construction system for web information sources. In: Proceedings of the IEEE 16th international conference on data engineering, pp 611–621
Gulhane P, Madaan A, Mehta R, et al (2011) Web-scale information extraction with vertex. In: ICDE, pp 209–220
Etzioni O, Cafarella M, Downey D, Kok S, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in knowitall (preliminary results). In: Proceedings of the 13th international conference on World Wide Web. ACM, New York, pp 100–110
Gupta S, Kaiser G, Neistadt D, Grimm P (2003) Dom-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214
Gupta S, Kaiser GE, Grimm P, Chiang MF, Starren J (2005) Automating content extraction of html documents. In: Proceedings of the 14th international conference on World Wide Web, vol 8, no. 2, pp 179–224, 6
Zheng S, Song R, Wen J-R, Giles CL (2009) Efficient record-level wrapper induction. In: Proceedings of the 18th international conference on information and knowledge management (CIKM). ACM, New York, pp 47–56
Dalvi N, Kumar R, Soliman M (2011) Automatic wrappers for large scale web extraction. Proc VLDB Endow 4(4):219–230
Buttler D, Liu L, Pu C (2001) A fully automated object extraction system for the world wide web. In: Proceedings of IEEE 21st international conference on distributed computing systems, pp 361–370
Chang C, Lui S (2001) Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th international conference on World Wide Web. ACM, New York, pp 681–688
Kayed M (2012) Peer matrix alignment: a new algorithm. In: Advances in knowledge discovery and data mining, pp 268–279
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 76–85
Liu B, Zhai Y (2005) Net-a system for extracting web data from flat and nested data records. In: WISE, vol 2005. Springer, pp 487–495
Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: Proceedings of ACM SIGMOD international conference on management of data, pp 337–348
Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614–1628
Zhao H, Meng W, Yu C (2006) Automatic extraction of dynamic record sections from search engine result pages. In: Proceedings of the 32nd international conference on very large data bases, pp 989–1000
Bing L, Lam W, Gu Y (2011) Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 1265–1274
Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B (2007) Towards domain-independent information extraction from web tables. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 71–80
Cai D, Yu S, Wen J-R, Ma W-Y (2003) Extracting content structure for web pages based on visual representation. In: Web technologies and applications, pp 406–417
Simon K, Lausen G (2005) Viper: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 381–388
Bing L, Lam W, Wong T-L (2013) Robust detection of semi-structured web records using a dom structure-knowledge-driven model. ACM Trans Web 7(4):21:1–21:32
Zhao H, Meng W, Wu Z, Raghavan V, Yu C (2005) Fully automatic wrapper generation for search engines. In: Proceedings of the 14th international conference on World Wide Web (WWW). ACM, New York, pp 66–75
Liu W, Meng X, Meng W (2010) Vide: a vision-based approach for deep web data extraction. IEEE Trans Knowl Data Eng 22(3):447–460
Fumarola F, Weninger T, Barber R, Malerba D, Han J (2011) Hylien: a hybrid approach to general list extraction on the web. In: Proceedings of the 20th international conference on World Wide Web (WWW), pp 35–36
Furche T, Gottlob G, Grasso G, Guo X, Orsi G, Schallhart C, Wang C (2014) Diadem: thousands of websites to a single database. Proc VLDB Endow 7(14):1845–1856
Crescenzi V, Merialdo P, Qiu D (2013) Alfred: crowd assisted data extraction. In: Proceedings of the 22th international conference on World Wide Web (WWW), pp 297–300
Kondreddi SK, Triantafillou P, Weikum G (2014) Combining information extraction and human computing for crowdsourced knowledge acquisition. In: IEEE 30th international conference on data engineering, pp 988–999
Pasternack J, Roth D (2009) Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th international conference on World Wide Web (WWW), pp 971–980
Weninger T, Hsu WH (2008) Text extraction from the web via text-to-tag ratio. In: 19th international workshop on database and expert systems application, pp 23–28
Weninger T, Hsu WH, Han J (2010) Cetr: content extraction via tag ratios. In: WWW, pp 971–980
Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: SIGIR, pp 245–254
Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th international conference on World Wide Web (WWW), pp 1264–1274
Song D, Sun F, Liao L (2015) A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl Inf Syst 42(1):75–96
Chiu D-Y, Wu Y-H, Chen A (2009) Efficient frequent sequence mining by a dynamic strategy switching algorithm. VLDB J 18(1):303–327
Loh W-K, Ahn H (2011) A storage-efficient suffix tree construction algorithm for human genome sequences. IEICE Trans 94–D(12):2557–2560
Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: ICCN, pp 215–224
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng, vol 16
Xie X, Fang Y, Zhang Z, Li L (2012) Extracting data records from web using suffix tree. In: Proceedings of the 18th SIGKDD workshop on mining data semantics. ACM
Maaß M (1999) Suffix trees and their applications. Ferienakademie
Grossi R, Italiano GF (1993) Suffix trees and their applications in string algorithms. In: Proceedings of the 1st south American workshop on string processing, pp 57–76
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 46–54
Chim H, Deng X (2007) A new suffix tree similarity measure for document clustering. In Proceedings of the 16th international conference on World Wide Web. ACM, pp 121–130
Fang Y, Zhang H, Ye Y, Li X (2014) Detecting hot topics from twitter: a multiview approach. J Inf Sci 40(5):578–593
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260
Farach M (1997) Optimal suffix tree construction with large alphabets. In: FOCS, pp 137–143
Greenberg RI (2003) Bounds on the number of longest common subsequences. CoRR, vol. cs.DM/0301030. [Online]. Available: http://arxiv.org/abs/cs.DM/0301030
Knuth DE, Morris JH, Pratt VR (1977) Fast pattern matching in strings. SAIM J Comput 6(2):323–350
Yamada Y, Craswell N, Nakatoh T, Hirokawa S (2004) Testbed for information extraction from deep web. In: Proceedings of the 13th international conference on World Wide Web (WWW). ACM, New York, pp 346–347
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Acknowledgements
Reynold Cheng and Yixiang Fang were supported by the Research Grants Council of Hong Kong (RGC Projects HKU 17229116 and 17205115) and the University of Hong Kong (Projects 102009508 and 104004129). Xiaoqin Xie was supported by the China Scholarship Council and the National Natural Science Foundation of China (Nos. 61202090, 61370084, 61272184), the Science and Technology Innovation Talents Special Fund of Harbin under Grant (No. 2015RQQXJ067), the Fundamental Research Funds for the Central Universities under Grant (No. HEUCF10060). Xiaofeng Zhang was partially supported by the National Science Foundation of China under Grant No. 61370213, National Key Technology Support Program of China under Grant No. 2014BAL05B06, and Shenzhen Science and Technology Program under Grant No. JCYJ20160330163900579. We would like to thank the reviewers for their insightful comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fang, Y., Xie, X., Zhang, X. et al. STEM: a suffix tree-based method for web data records extraction. Knowl Inf Syst 55, 305–331 (2018). https://doi.org/10.1007/s10115-017-1062-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1062-0