[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2479787.2479798acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Web object identification for web automation and meta-search

Published: 12 June 2013 Publication History

Abstract

Web object identification plays an important role in research fields such as information extraction, web automation, and web form understanding for building meta-search engines. In contrast to other works, we approach this problem by analyzing various spatial, visual, functional and textual characteristics of web pages. We compute 49 unique features for all visible web page elements, which are then applied to machine learning classifiers in order to identify similar elements on other previously unexamined web pages. We evaluate our approach with different scenarios by analyzing the relevance of the chosen features and the classification rate of the applied classifiers. These scenarios focus on understanding search forms from the transportation domain, particularly flight, train, and bus connections. The results of the evaluation are very promising.

References

[1]
TAMCROW --- Task mining and crowd sourcing. FFG Fit-IT Project 829614, 2011--2012. http://www.dbai.tuwien.ac.at/proj/tamcrow/.
[2]
CSS Level 2 Revision 1 (CSS 2.1) Specification (W3C Recommendation 07 June 2011), 2011.
[3]
ATW Dataset. http://www.dbai.tuwien.ac.at/proj/tamcrow/atw/, 2012.
[4]
H. Akaike. A new look at the statistical model identification. IEEE Trans. on Automatic Control, 19(6):716--723, 1974.
[5]
A. Bartoli, E. Medvet, and M. Mauri. Recording and replaying navigations on AJAX web sites. Web Engineering, 7387:370--377, 2012.
[6]
J. Byrne, C. Heavey, and P. Byrne. A review of Web-based simulation and supporting tools. Simulation Modelling Practice and Theory, 18(3):253--276, Mar. 2010.
[7]
F. J. Damerau. A technique for computer detection and correction of spelling errors. Commun. of the ACM, 7(3):171--176, Mar. 1964.
[8]
M. E. Dincturk, S. Choudhary, G. von Bochmann, G.-V. Jourdan, and I. V. Onut. A statistical approach for efficient crawling of rich internet applications. In Proc. of the 12th Int. Conf. on Web Engineering, pages 362--369, Berlin, 2012. Springer.
[9]
E. C. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchical approach to model web query interfaces for web source integration. In Proc. of VLDB Endowment, volume 2, pages 325--336. VLDB, 2009.
[10]
C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou. AJAX Crawl: Making AJAX applications searchable. In Proc. of the IEEE 25th Int. Conf. on Data Engineering, pages 78--89. IEEE, 2009.
[11]
M. Y. Erlewine. Ubiquity: Designing a multilingual natural language interface features of a natural syntax. In SIGIR Workshop on IAMW, page 4, Boston, 2009.
[12]
R. R. Fayzrakhmanov. A blocks-based geometric model of web pages for automatic processing and information extraction. Science and Business: Development Ways, 15(9):56--64, 2012.
[13]
R. R. Fayzrakhmanov. WPPS: A novel and comprehensive framework for web page understanding and information extraction. In Proc of IADIS WWW/Internet, pages 19--26, Madrid, 2012. IADIS.
[14]
R. R. Fayzrakhmanov, M. C. Göbel, W. Holzinger, B. Krüpl, and R. Baumgartner. A Unified ontology-based web page model for improving accessibility. In Proc. of the World Wide Web 2010, pages 1087--1088, New York, NY, US, 2010. ACM.
[15]
R. R. Fayzrakhmanov, C. Herzog, and I. Kordomatis. Web objects identification for web automation: objects and their features. Technical report DBAI-TR-2013-80, Institute of Information Systems, TU Vienna, Vienna, 2013.
[16]
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and C. Schallhart. OPAL: Automated form understanding for the deep web. In Proc. of WWW 2012, pages 829--838, New York, 2012. ACM.
[17]
C. Herzog, I. Kordomatis, W. Holzinger, R. R. Fayzrakhmanov, and B. Krüpl-Sypien. Feature-based object identification for web automation. In Proc. of the 28th Annual ACM SAC'13, pages 742--749, Coimbra, Portugal, 2013. ACM.
[18]
J. Keith. DOM Scripting: Web design with JavaScript and the Document Object Model. Springer, New York, the USA, 2005.
[19]
R. Khare and Y. An. An empirical study on using hidden markov model for search interface segmentation. In Proc. of the 18th ACM CIKM '09, page 17, New York, 2009. ACM.
[20]
R. Khare, Y. An, and I.-Y. Song. Understanding deep web search interfaces: A survey. ACM SIGMOD Record, 39(1):33--40, 2010.
[21]
B. Krüpl-Sypien, R. R. Fayzrakhmanov, W. Holzinger, M. Panzenböck, and R. Baumgartner. A versatile model for web page representation, information extraction and content re-packaging. In Proc. of the DocEng'11, pages 129--138, 2011.
[22]
G. Leshed, E. M. Haber, T. Matthews, and T. Lau. CoScripter: automating & sharing how-to knowledge in the enterprise. In Proc. of the SIGCHI '08, pages 1719--1728, Florence, 2008. ACM.
[23]
P. McCullagh and J. Nelder. Generalized linear models, volume 37. 1989.
[24]
A. Memon, I. Banerjee, and A. Nagarajan. GUI ripping: reverse engineering of graphical user interfaces for testing. In Proc. of the 10th WCRE '03, pages 260--269, Washington, 2003. IEEE.
[25]
A. Mesbah, E. Bozdag, and A. V. Deursen. Crawling AJAX by inferring user interface state changes. In Proc. of the ICWE '08, pages 122--134, Washington, July 2008. IEEE.
[26]
H. Nguyen, T. Nguyen, and J. Freire. Learning to extract form labels. Proc. of the VLDB Endowment, 1(1):684--694, 2008.
[27]
V. Vapnik. Statistical learning theory, 1998.
[28]
R. Vulanović and R. Köhler. Syntactic units and structures. In Quantitative Linguistics, pages 274--291. de Gruyter, Berlin, 2005.
[29]
Y. Wang and I. Witten. Induction of model trees for predicting continuous classes. 1996.
[30]
Z. Zhang, B. He, and K. C.-C. Chang. Understanding web query interfaces: best-effort parsing with hidden syntax. In Proc. of the ACM COMAD'04, pages 107--118, 2004.

Cited By

View all
  • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
  • (2022)RLBrowse: Generating Realistic Packet Traces with Reinforcement LearningNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789851(1-6)Online publication date: 25-Apr-2022
  • (2020)Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up InferenceProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380608(1967-1978)Online publication date: 11-Jun-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
June 2013
408 pages
ISBN:9781450318501
DOI:10.1145/2479787
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • UAM: Autonomous University of Madrid

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. machine learning
  2. web accessibility
  3. web automation
  4. web object identification
  5. web page visual representation

Qualifiers

  • Research-article

Conference

WIMS '13
Sponsor:
  • UAM

Acceptance Rates

WIMS '13 Paper Acceptance Rate 28 of 72 submissions, 39%;
Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
  • (2022)RLBrowse: Generating Realistic Packet Traces with Reinforcement LearningNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789851(1-6)Online publication date: 25-Apr-2022
  • (2020)Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up InferenceProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380608(1967-1978)Online publication date: 11-Jun-2020
  • (2019)Large-scale holistic approach to Web block classificationWorld Wide Web10.1007/s11280-018-0634-622:5(1999-2015)Online publication date: 1-Sep-2019
  • (2018)Browserless Web Data ExtractionProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186008(1095-1104)Online publication date: 10-Apr-2018
  • (2018)$${{\textsc {ber}}}_{y}{\textsc {l}}$$BERyL: A System for Web Block ClassificationTransactions on Computational Science XXXIII10.1007/978-3-662-58039-4_4(61-78)Online publication date: 16-Sep-2018
  • (2018)Web Page Representations and Data Extraction with BERyLCurrent Trends in Web Engineering10.1007/978-3-030-03056-8_3(22-30)Online publication date: 29-Nov-2018
  • (2015)Models and Approaches for Web Information Extraction and Web Page UnderstandingThe Evolution of the Internet in the Business Sector10.4018/978-1-4666-7262-8.ch002(25-50)Online publication date: 2015
  • (2015)The Augmented WebACM Transactions on the Web10.1145/27356339:2(1-30)Online publication date: 19-May-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media