[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3132847.3132875acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Extracting Records from the Web Using a Signal Processing Approach

Published: 06 November 2017 Publication History

Abstract

Extracting records from web pages enables a number of important applications and has immense value due to the amount and diversity of available information that can be extracted. This problem, although vastly studied, remains open because it is not a trivial one. Due to the scale of data, a feasible approach must be both automatic and efficient (and of course effective). We present here a novel approach, fully automatic and computationally efficient, using signal processing techniques to detect regularities and patterns in the structure of web pages. Our approach segments the web page, detects the data regions within it, identifies the records boundaries and aligns the records. Results show high f-score and linearithmic time complexity behaviour.

References

[1]
Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska, Philip A Bernstein, Michael J Carey, Surajit Chaudhuri, Jeffrey Dean, AnHai Doan, Michael J Franklin, et almbox. 2014. The beckman report on database research. SIGMOD, Vol. 43, 3 (2014), 61--70.
[2]
Arvind Arasu and Hector Garcia-Molina. 2003. Extracting structured data from web pages. In SIGMOD. ACM, 337--348.
[3]
Sreeram Balakrishnan, Alon Y Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu. 2015. Applying WebTables in Practice. In CIDR.
[4]
Michael J Cafarella. 2012. ACSDb Download. http://web.eecs.umich.edu/ michjc/data/acsdb.html. (2012).
[5]
Michael J Cafarella, Alon Halevy, and Jayant Madhavan. 2011. Structured data on the web. Commun. ACM Vol. 54, 2 (2011), 72--79.
[6]
Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008 a. Webtables: exploring the power of tables on the web. VLDB, Vol. 1, 1 (2008), 538--549.
[7]
Michael J Cafarella, Alon Y Halevy, Yang Zhang, Daisy Zhe Wang, and Eugene Wu. 2008 b. Uncovering the Relational Web. In WebDB. Citeseer.
[8]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Extracting content structure for web pages based on visual representation. Web Technologies and Applications. Springer, 406--417.
[9]
Xu Chu, Yeye He, Kaushik Chakrabarti, and Kris Ganjam. 2015. Tegra: Table extraction by global record alignment SIGMOD. ACM, 1713--1728.
[10]
James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation Vol. 19, 90 (1965), 297--301.
[11]
Eli Cortez, Altigran S da Silva, Marcos André Gonccalves, and Edleno S de Moura. 2010. Ondux: on-demand unsupervised learning for information extraction SIGMOD. ACM, 807--818.
[12]
Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo, et almbox. 2001. Roadrunner: Towards automatic data extraction from large web sites VLDB, Vol. Vol. 1. 109--118.
[13]
Isaac Elias. 2006. Settling the intractability of multiple alignment. Journal of Computational Biology Vol. 13, 7 (2006), 1323--1339.
[14]
Hazem Elmeleegy, Jayant Madhavan, and Alon Halevy. 2009. Harvesting relational tables from lists on the web. VLDB, Vol. 2, 1 (2009), 1078--1089.
[15]
Brian S Everitt. 2010. The Cambridge dictionary of statistics. Cambridge University Press. 89 pages.
[16]
Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowledge-based systems Vol. 70 (2014), 301--323.
[17]
Christine Gfrerer, Marián Vajtervsic, and Rade Kutil. 2017. Parallel Algorithms to Align Multiple Strings in the Context of Web Data Extraction. Emergent Computation. Springer, 525--578.
[18]
Gerald Goertzel. 1958. An algorithm for the evaluation of finite trigonometric series. The American Mathematical Monthly Vol. 65, 1 (1958), 34--35.
[19]
Tomas Grigalis. 2013. Towards web-scale structured web data extraction. Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 753--758.
[20]
Dan Gusfield. 1993. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of mathematical biology Vol. 55, 1 (1993), 141--154.
[21]
Dan Gusfield. 1997. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press.
[22]
Nitin Jindal and Bing Liu. 2010. A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. SDM. SIAM, 930--941.
[23]
Mohammed Kayed and Chia-Hui Chang. 2010. FiVaTech: Page-level web data extraction from template pages. Knowledge and Data Engineering, IEEE Transactions on, Vol. 22, 2 (2010), 249--263.
[24]
Bing Liu, Robert Grossman, and Yanhong Zhai. 2003. Mining data records in Web pages. In SIGKDD. ACM, 601--606.
[25]
Bing Liu and Yanhong Zhai. 2005. NET--a system for extracting web data from flat and nested data records. Web Information Systems Engineering--WISE 2005. Springer, 487--495.
[26]
Wei Liu, Xiaofeng Meng, and Weiyi Meng. 2010. Vide: A vision-based approach for deep web data extraction. Knowledge and Data Engineering, IEEE Transactions on, Vol. 22, 3 (2010), 447--460.
[27]
Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. Vol. 22, 5 (1993), 935--948.
[28]
Gengxin Miao, Junichi Tatemura, Wang-Pin Hsiung, Arsany Sawires, and Louise E Moser. 2009. Extracting data records from the web using tag path clustering WWW. ACM, 981--990.
[29]
Alan V Oppenheim, Ronald W Schafer, John R Buck, et almbox. 1989. Discrete-time signal processing. Vol. Vol. 2. Prentice hall Englewood Cliffs, NJ.
[30]
Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. Dexter: large-scale discovery and extraction of product specifications on the web. VLDB, Vol. 8, 13 (2015), 2194--2205.
[31]
Shengsheng Shi, Chengfei Liu, Yi Shen, Chunfeng Yuan, and Yihua Huang. 2015. AutoRM: An effective approach for automatic Web data record mining. Knowledge-Based Systems Vol. 89 (2015), 314--331.
[32]
Kai Simon and Georg Lausen. 2005. ViPER: augmenting automatic information extraction with visual perceptions Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 381--388.
[33]
Hassan A Sleiman and Rafael Corchuelo. 2013. A survey on region extractors from web documents. Knowledge and Data Engineering, IEEE Transactions on, Vol. 25, 9 (2013), 1960--1981.
[34]
Esko Ukkonen. 1995. On-line construction of suffix trees. Algorithmica, Vol. 14, 3 (1995), 249--260.
[35]
Roberto Panerai Velloso and Carina F Dorneles. 2013. Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences. Journal of Information and Data Management Vol. 4, 3 (2013), 173.
[36]
Haizhou Wang and Mingzhou Song. 2011. Ckmeans. 1d. dp: optimal k-means clustering in one dimension by dynamic programming. The R Journal, Vol. 3, 2 (2011), 29--33.
[37]
Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q Zhu. 2012. Understanding tables on the web. Conceptual Modeling. Springer, 141--155.
[38]
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. 2012. Probase: A probabilistic taxonomy for text understanding SIGMOD. ACM, 481--492.
[39]
Xiaoqin Xie, Yixiang Fang, Zhiqiang Zhang, and Li Li. 2012. Extracting data records from web using suffix tree SIGKDD. ACM, 12.
[40]
Yasuhiro Yamada, Nick Craswell, Tetsuya Nakatoh, and Sachio Hirokawa. 2004. Testbed for information extraction from deep web. WWW. ACM, 346--347.
[41]
Yanhong Zhai and Bing Liu. 2005. Web data extraction based on partial tree alignment WWW. ACM, 76--85.
[42]
Zhixian Zhang, Kenny Q Zhu, Haixun Wang, and Hongsong Li. 2013. Automatic extraction of top-k lists from the web. ICDE. IEEE, 1057--1068.
[43]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory Vol. 23, 3 (1977), 337--343.

Cited By

View all
  • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
  • (2022)Multimodal Web Page Segmentation Using Self-organized Multi-objective ClusteringACM Transactions on Information Systems10.1145/348096640:3(1-49)Online publication date: 7-Mar-2022
  • (2019)Web Page Structured Content Detection Using Supervised Machine LearningWeb Engineering10.1007/978-3-030-19274-7_1(3-18)Online publication date: 26-Apr-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information retrieval
  2. record alignment
  3. record extraction
  4. structure detection
  5. web mining

Qualifiers

  • Research-article

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
  • (2022)Multimodal Web Page Segmentation Using Self-organized Multi-objective ClusteringACM Transactions on Information Systems10.1145/348096640:3(1-49)Online publication date: 7-Mar-2022
  • (2019)Web Page Structured Content Detection Using Supervised Machine LearningWeb Engineering10.1007/978-3-030-19274-7_1(3-18)Online publication date: 26-Apr-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media