[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Understanding query interfaces by statistical parsing

Published: 29 May 2013 Publication History

Abstract

Users submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we present a novel query interface parsing method, StatParser (Statistical Parser), to automatically extract the hierarchical query capabilities of query interfaces. StatParser automatically learns from a set of parsed query interfaces and parses new query interfaces. StatParser starts from a small grammar and enhances the grammar with a set of probabilities learned from parsed query interfaces under the maximum-entropy principle. Given a new query interface, the probability-enhanced grammar identifies the parse tree with the largest global probability to be the query capabilities of the query interface. Experimental results show that StatParser very accurately extracts the query capabilities and can effectively overcome the problems of existing query interface parsers.

References

[1]
Barbosa, L. and Freire, J. 2007a. An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 441--450.
[2]
Barbosa, L. and Freire, J. 2007b. Combining classifiers to identify online databases. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 431--440.
[3]
Benslimane, S. M., Malki, M., Rahmouni, M. K., and Benslimane, D. 2007. Extracting personalized ontology from data-intensive web application: An html forms-based reverse engineering approach. Informatica 18, 4, 511--534.
[4]
Bergman, M. K. 2001. White paper: The deep web: Surfacing hidden value. J. Electron. Publish. 7, 1.
[5]
Borthwick, A. 1999. A maximum entropy approach to named entity recognition. Doctoral dissertation, New York University, New York.
[6]
Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z. 2004. Structured databases on the Web: Observations and implications. ACM SIGMOD Rec. 33, 3, 61--70.
[7]
Charniak, E. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (NAACL'00). Association for Computational Linguistics, 132--139.
[8]
Dragut, E. C., Kabisch, T., Yu, C., and Leser, U. 2009. A hierarchical approach to model web query interfaces for web source integration. Proc. VLDB Endow. 2, 1, 325--336.
[9]
Dragut, E. C., Meng, W., and Yu, C. T. 2012. Deep Web Query Interface Understanding and Integration. Morgan and Claypool Publishers, San Francisco, CA.
[10]
Dragut, E., Wu, W., Sistla, P., Yu, C., and Meng, W. 2006. Merging source query interfaces on web databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). IEEE Computer Society, 679--690.
[11]
Feiner, A., Kraus, S., and Korf, R. E. 2003. KBFS: K-best-first search. Ann. Math. Artif. Intell. 39, 1--2, 19--39.
[12]
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., and Schallhart, C. 2011. Real understanding of real estate forms. In Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS'11). ACM Press, New York.
[13]
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., and Schallhart, C. 2012. OPAL: Automated form understanding for the deep web. In Proceedings of the 21st International Conference on World Wide Web (WWW'12). ACM Press, New York, 829--838.
[14]
Guo, X., Kranzdorf, J., Furche, T., Grasso, G., Orsi, G., and Schallhart, C. 2012. OPAL: A passepartout for web forms. In Proceedings of the 21st International Conference Companion on World Wide Web. ACM Press, New York, 353--356.
[15]
He, B., Zhang, Z., and Chang, K. C.-C. 2005a. MetaQuerier: Querying structured web sources on-the-fly. In Proceedings of ACM SIGMOD Conference (SIGMOD'05). ACM Press, New York, 927--929.
[16]
He, H., Meng, W., Lu, Y., Yu, C., and Wu, Z. 2007. Towards deeper understanding of the search interfaces of the deep web. World Wide Web 10, 2, 133--155.
[17]
He, H., Meng, W., Lu, Y., Yu, C., and Wu, Z. 2005b. Constructing interface schemas for search interfaces of web databases. In Proceedings of the 6th International Conference on Web Information Systems Engineering (WISE'05). Springer, 29--42.
[18]
Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., and Paepcke, A. 2001. Efficient web form entry on PDAs. In Proceedings of the 10th International Conference on World Wide Web (WWW'01). ACM Press, New York, 663--672.
[19]
Khare, R. and An, Y. 2009. An empirical study on using hidden markov model for search interface segmentation. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM Press, New York, 17--26.
[20]
Khare, R., An, Y., and Song, I.-Y. 2010. Understanding deep web search interfaces: A survey. ACM SIGMOD Rec. 39, 1, 33--40.
[21]
Madhavan, J., Cohen, S., Dong, X. L., Halevy, A. Y., Jeffery, S. R., Ko, D., and Yu, C. 2007. Web-scale data integration: You can afford to pay as you go. In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research (CIDR'07). 342--350.
[22]
Minka, T. P. 2003. A comparison of numerical optimizers for logistic regression. Tech. rep., Department of Statistics, Carnegie Mellon University. October.
[23]
Nguyen, H., Nguyen, T., and Freire, J. 2008. Learning to extract form labels. Proc. VLDB Endow. 1, 1, 684--694.
[24]
Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01). Morgan Kaufmann Publishers, San Francisco, CA, 129--138.
[25]
Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the 1st Empirical Methods in Natural Language Processing Conference. 133--142.
[26]
Sheng, C., Zhang, N., Tao, Y., and Jin, X. 2012. Optimal algorithms for crawling a hidden database in the web. Proc. VLDB Endow. 5, 11, 1112--1123.
[27]
Shestakov, D., Bhowmick, S. S., and Lim, E.-P. 2005. DEQUE: Querying the deep web. Data Knowl. Engin. 52, 3, 273--311.
[28]
Su, W., Wang, J., and Lochovsky, F. H. 2006a. Automatic hierarchical classification of structured deep web databases. In Proceedings of the 7th International Conference on Web Information Systems Engineering (WISE'06). Springer, 210--221.
[29]
Su, W., Wang, J., and Lochovsky, F. H. 2006b. Holistic schema matching for web query interfaces. In Proceedings of the 10th International Conference on Extending Database Technology (EDBT'06). Springer, 77--94.
[30]
Su, W., Wang, J., and Lochovsky, F. H. 2009. ODE: Ontology-assisted data extraction. ACM Trans. Datab. Syst. 34, 2.
[31]
Vieira, K., Barbosa, L., Freire, J., and Silva, A. 2008. Siphon++: A hidden-web crawler for keyword-based interfaces. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). ACM Press, New York, 1361--1362.
[32]
Wu, P., Wen, J.-R., Liu, H., and Ma, W.-Y. 2006. Query selection techniques for efficient crawling of structured web sources. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). IEEE Computer Society, 47--58.
[33]
Wu, W., Doan, A., Yu, C., and Meng, W. 2009. Modeling and extracting deep-web query interfaces. In Advances in Information and Intelligent Systems, Springer, 65--90.
[34]
Wu, W., Yu, C., Doan, A., and Meng, W. 2004. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of the ACM SIGMOD Conference (SIGMOD'04). ACM Press, New York, 95--106.
[35]
Zhang, T. and Oles, F. J. 2001. Text categorization based on regularized linear classification methods. Inf. Retr. 4, 1, 5--31.
[36]
Zhang, Z., He, B., and Chang, K. C.-C. 2004. Understanding web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'04). ACM Press, New York, 107--118.

Cited By

View all

Index Terms

  1. Understanding query interfaces by statistical parsing

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 7, Issue 2
      May 2013
      244 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2460383
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 May 2013
      Accepted: 01 January 2013
      Revised: 01 October 2012
      Received: 01 March 2012
      Published in TWEB Volume 7, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Query interface
      2. maximum entropy

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Dependency-aware Form Understanding2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00026(139-149)Online publication date: Oct-2021
      • (2021)WebQuIn-LD: A Method of Integrating Web Query Interfaces Based on Linked DataIEEE Access10.1109/ACCESS.2021.31045249(115664-115675)Online publication date: 2021
      • (2019)Deep Web crawlingWorld Wide Web10.1007/s11280-018-0602-122:4(1577-1610)Online publication date: 1-Jul-2019
      • (2019)Schema Extraction for Deep Web Query Interfaces Using Heuristics RulesInformation Systems Frontiers10.1007/s10796-018-9863-621:1(163-174)Online publication date: 1-Feb-2019
      • (2017)Semantic Analysis Based Approach for Relevant Text Extraction Using OntologyInternational Journal of Information Retrieval Research10.4018/IJIRR.20171001027:4(19-36)Online publication date: 1-Oct-2017
      • (2017)Heuristics-Based Schema Extraction for Deep Web Query Interfaces2017 IEEE International Conference on Information Reuse and Integration (IRI)10.1109/IRI.2017.80(389-396)Online publication date: Aug-2017
      • (2015)Semantics-Assisted Deep Web Query Interface ClassificationProceedings of the Eighth International C* Conference on Computer Science & Software Engineering10.1145/2790798.2790810(70-78)Online publication date: 13-Jul-2015
      • (2014)DIADEMProceedings of the VLDB Endowment10.14778/2733085.27330917:14(1845-1856)Online publication date: 1-Oct-2014
      • (2014)Query interfaces understanding by statistical parsingProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2579702(1291-1294)Online publication date: 7-Apr-2014
      • (2013)The ontological keyThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-013-0323-022:5(615-640)Online publication date: 1-Oct-2013

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media