[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Inference of concise regular expressions and DTDs

Published: 03 May 2010 Publication History

Abstract

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small datasets.

Supplementary Material

Bex Appendix (a11-bex-apndx.pdf)
Online appendix to inference of concise regular expressions and DTDs on article 11.

References

[1]
Abiteboul, S., Buneman, P., and Suciu, D. 1999. Data on the Web. Morgan Kaufmann Publishers.
[2]
Ahonen, H. 1996. Generating grammars for structured documents using grammatical inference methods. Ph.D. thesis, Report A-1996-4. Department of Computer Science, University of Helsinki.
[3]
Angluin, D. and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Comput. Surv. 15, 3, 237--269.
[4]
Barbosa, D., Mendelzon, A. O., Keenleyside, J., and Lyons, K. A. 2002. ToXgene: An extensible template-based data generator for XML. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002). 49--54.
[5]
Barbosa, D., Mignet, L., and Veltri, P. 2006. Studying the XML web: Gathering statistics from an XML sample. World Wide Web 9, 2, 187--212.
[6]
Benedikt, M., Fan, W., and Geerts, F. 2008. XPath satisfiability in the presence of DTDs. J. ACM 55, 2, 1--79.
[7]
Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Online Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR'03).
[8]
Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. Learning deterministic regular expressions for the inference of schemas from XML data. http://arxiv.org/abs/1004.2372.
[9]
Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular expressions for the inference of schemas from XML data. In Proceeding of the 17th International Conference on World Wide Web (WWW'08). 825--834.
[10]
Bex, G. J., Neven, F., and den Bussche, J. V. 2004. DTDs versus XML Schema: A practical study. In Proceedings of the International Workshop on Web and Database (WebDB). S. Amer-Yahia and L. Gravano, Eds. 79--84.
[11]
Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds. ACM, 115--126.
[12]
Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 998--1009.
[13]
Brāzma, A. 1993. Efficient identification of regular expressions from representative examples. In Proceedings of the 6th Annual Conference on Computational Learning Theory (COLT'93). ACM Press, 236--242.
[14]
Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2, 197--213.
[15]
Brüggemann-Klein, A. and Wood, D. 1998. One-Unambiguous regular languages. Inform. Comput. 140, 2, 229--253.
[16]
Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (ICDT'97). Lecture Notes in Computer Science, vol. 1186. Springer, 336--350.
[17]
Caron, P. and Ziadi, D. 2000. Characterization of Glushkov automata. Theor. Comput. Sci. 233, 1--2, 75--90.
[18]
Castor. The Castor project. www.castor.org.
[19]
Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB'01). CEUR Workshop Proceedings, vol. 45.
[20]
Clark, J. Trang: Multi-Format schema converter based on RELAX NG. www.thaiopensource.com/relaxng/trang.html.
[21]
Cover, R. 2003. The Cover Pages. xml.coverpages.org.
[22]
Delgado, M. and Morais, J. 2004. Approximation to the smallest regular expression for a given regular language. In Proceedings of the, 9th International Conference on Implementation and Application of Automata. Lecture Notes in Computer Science, vol. 3317. Springer, 312--314.
[23]
Deutsch, A., Fernandez, M. F., and Suciu, D. 1999. Storing semistructured data with STORED. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 431--442.
[24]
Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 134--146.
[25]
Fernandez, M. F. and Suciu, D. 1998. Optimizing regular path expressions using graph schemas. In Proceedings of the 14th International Conference on Data Engineering (ICDE'98). 14--23.
[26]
Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications. Lecture Notes in Artificial Intelligence, vol. 3264. Springer, 277--278.
[27]
Fernau, H. 2009. Algorithms for learning regular expressions from positive data. Inform. Comput. 207, 4, 521--541.
[28]
Florescu, D. 2005. Managing semi-structured data. ACMQueue 3, 8, 18--24.
[29]
García, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920--925.
[30]
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Mining Knowl. Discov. 7, 23--56.
[31]
Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer Science (STACS'08). Dagstuhl Seminar Proceedings, vol. 08001. 325--336.
[32]
Gold, E. 1967. Language identification in the limit. Inform. Control 10, 5, 447--474.
[33]
Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97). 436--445.
[34]
Gruber, H. and Holzer, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the 35th International Colloquium on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 5126. Springer, 39--50.
[35]
Han, Y.-S. and Wood, D. 2007. Obtaining shorter regular expressions from finite-state automata. Theor. Comput. Sci. 370, 1--3, 110--120.
[36]
Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06). IEEE Computer Society, 81--97.
[37]
Hinkelman, S. 2005. Business integration—Information conformance statements (BI-ICS). Tech. rep., IBM DeveloperWorks.
[38]
Hopcroft, J. and Ullman, J. 1979. Introduction to Automata Theory, Languages and computation. Addison-Wesley.
[39]
Huet, G. 1980. Confluent reductions: Abstract properties and applications to term rewriting systems. J. ACM 27, 4, 797--821.
[40]
Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-Based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). 228--239.
[41]
Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 241--250.
[42]
Martens, W., Neven, F., and Schwentick, T. 2007. Simple off the shelf abstractions for XML schema. SIGMOD Rec. 36, 3, 15--22.
[43]
Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Data. Syst. 31, 3.
[44]
McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54--66.
[45]
Melnik, S. 2004. Generic model management: Concepts and algorithms. Ph.D. thesis, University of Leipzig.
[46]
Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML web: A first study. In Proceedings of the 12th International World Wide Web Conference. 500--510.
[47]
Miklau, G. 2002. XMLData repository. www.cs.washington.edu/research/xmldatasets.
[48]
Min, J.-K., Ahn, J.-Y., and Chung, C.-W. 2003. Efficient extraction of schemas for XML documents. Inform. Process. Lett. 85, 1, 7--12.
[49]
Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 295--306.
[50]
Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the 13th International Conference on Data Engineering. IEEE Computer Society, 79--90.
[51]
Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence of disjunction, DTDs, and variables. Logical Methods Comput. Sci. 2, 3.
[52]
Ngu, A. H. H., Rocco, D., Critchlow, T., and Buttler, D. 2005. Automatic discovery and inferencing of complex bioinformatics web interfaces. World Wide Web 8, 4, 463--493.
[53]
Oaks, P. and ter Hofstede, A. H. M. 2007. Guided interaction: A mechanism to enable ad hoc service interaction. Inform. Syst. Frontiers 9, 1, 29--51.
[54]
Ohlebusch, E. 2001. Implementing conditional term rewriting by graph rewriting. Theor. Comput. Sci. 262, 1, 311--331.
[55]
Open Web Application Security Project Consortium. 2004. The top ten most critical web application security vulnerabilities—2004 update. www.owasp.org.
[56]
Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the International Workshop on Analogical and Inductive Inference (AII'89). Springer-Verlag, 18--44.
[57]
Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.
[58]
Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350.
[59]
Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (extended abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases, (WebDB'00), Selected Papers. 171--183.
[60]
Sakakibara, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1, 15--45.
[61]
Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings of the International Conference on Information and Knowledge Management. ACM Press, 159--166.
[62]
Sun. Sun JAXB. java.sun.com/webservices/jaxb.
[63]
Thompson, H. S., Beech, D., Maloney, M., and Mendelsohn, N. 2004. XML Schema part 1: Structures 2nd Ed. World Wide Web Consortium, Recommendation REC-xmlschema-1-20041028.
[64]
W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language, 2nd Ed. W3C.
[65]
Wang, G., Liu, M., Yu, J. X., Sun, B., Yu, G., Lv, J., and Lu, H. 2003. Effective schema-based XML query optimization techniques. In Proceedings of the 7th International Database Engineering and Applications Symposium. 230--235.

Cited By

View all
  • (2024)Detection and treatment of string events in the limitJournal of Computer Languages10.1016/j.cola.2024.10129981(101299)Online publication date: Nov-2024
  • (2024)Automatic regex synthesis methods for english: a comparative analysisKnowledge and Information Systems10.1007/s10115-024-02232-1Online publication date: 3-Oct-2024
  • (2023)Learning from Uncurated Regular Expressions for Semantic Type ClassificationProceedings of the 1st Workshop on Simplicity in Management of Data10.1145/3596225.3596226(1-5)Online publication date: 23-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 35, Issue 2
April 2010
336 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1735886
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2010
Accepted: 01 November 2009
Revised: 01 July 2009
Received: 01 January 2009
Published in TODS Volume 35, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Regular expressions
  2. XML
  3. schema inference

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)4
Reflects downloads up to 04 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Detection and treatment of string events in the limitJournal of Computer Languages10.1016/j.cola.2024.10129981(101299)Online publication date: Nov-2024
  • (2024)Automatic regex synthesis methods for english: a comparative analysisKnowledge and Information Systems10.1007/s10115-024-02232-1Online publication date: 3-Oct-2024
  • (2023)Learning from Uncurated Regular Expressions for Semantic Type ClassificationProceedings of the 1st Workshop on Simplicity in Management of Data10.1145/3596225.3596226(1-5)Online publication date: 23-Jun-2023
  • (2023)Search-Based Regular Expression Inference on a GPUProceedings of the ACM on Programming Languages10.1145/35912747:PLDI(1317-1339)Online publication date: 6-Jun-2023
  • (2023)InfeRE: Step-by-Step Regex Generation via Chain of InferenceProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE56229.2023.00111(1505-1515)Online publication date: 11-Nov-2023
  • (2022)A universal approach for multi-model schema inferenceJournal of Big Data10.1186/s40537-022-00645-99:1Online publication date: 11-Aug-2022
  • (2022)Schema inference for multi-model dataProceedings of the 25th International Conference on Model Driven Engineering Languages and Systems10.1145/3550355.3552400(13-23)Online publication date: 23-Oct-2022
  • (2022)Towards Theory for Real-World DataProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526066(261-276)Online publication date: 12-Jun-2022
  • (2022)SemMT: A Semantic-Based Testing Approach for Machine Translation SystemsACM Transactions on Software Engineering and Methodology10.1145/349048831:2(1-36)Online publication date: 1-Apr-2022
  • (2022)Membership Algorithm for Single-Occurrence Regular Expressions with Shuffle and CountingDatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_41(526-542)Online publication date: 11-Apr-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media