[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Published: 01 September 2010 Publication History

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

References

[1]
}}Adriaans, P. and Vitányi, P. 2006. The power and perils of MDL. In Proceedings of the International Symposium on Information Theory.
[2]
}}Ahonen, H. 1996. Generating Grammars for structured documents using grammatical inference methods. Tech. rep. A-1996-4, Department of Computer Science, University of Finland.
[3]
}}Angluin, D. and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Comput. Surv. 15, 3, 237--269.
[4]
}}Barbosa, D., Mignet, L., and Veltri, P. 2005. Studying the XML Web: Gathering statistics from an XML sample. World Wide Web 8, 4, 413--438.
[5]
}}Benedikt, M., Fan, W., and Geerts, F. 2005. XPath satisfiability in the presence of DTDs. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 25--36.
[6]
}}Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research.
[7]
}}Bex, G. J., Neven, F., and Van den Bussche, J. 2004. DTDs versus XML Schema: A practical study. In Proceedings of the 7th International Workshop on the Web and Databases. 79--84.
[8]
}}Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the 32nd International Conference on Very Large Data Bases. 115--126.
[9]
}}Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML Schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Databases. 998--1009.
[10]
}}Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular expressions for the inference of schemas from XML data. In Proceedings of the International Conference on the World Wide Web (WWW’08). 825--834.
[11]
}}Bex, G., Neven, F., Schwentick, T., and Vansummeren, S. 2010. Inference of concise regular expressions and DTDs. ACM Trans. Datab. Syst. 35, 2.
[12]
}}Brāzma, A. 1993. Efficient identification of regular expressions from representative examples. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory. ACM Press, 236--242.
[13]
}}Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2, 197--213.
[14]
}}Brüggemann-Klein, A. and Wood, D. 1998. One-unambiguous regular languages. Inform. Comput. 140, 2, 229--253.
[15]
}}Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the 6th International Conference on Database Theory (ICDT’97). F. N. Afrati and P. G. Kolaitis, Eds. Lecture Notes in Computer Science, vol. 1186. Springer, 336--350.
[16]
}}Che, D., Aberer, K., and Özsu, M. T. 2006. Query optimization in XML structured-document databases. VLDB J. 15, 3, 263--289.
[17]
}}Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases.
[18]
}}Clark, J. Trang: Multi-format schema converter based on RELAX NG. http://www.thaiopensource.com/relaxng/trang.html.
[19]
}}Clark, J. and Murata, M. 2001. RELAX NG specification. OASIS.
[20]
}}Cover, R. 2003. The Cover Pages. http://xml.coverpages.org/.
[21]
}}Du, F., Amer-Yahia, S., and Freire, J. 2004. ShreX: Managing XML documents in relational databases. In Proceedings of the 30th International Conference on Very Large Data Bases. 1297--1300.
[22]
}}Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 134--146.
[23]
}}Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the International Colloquium on Grammatical Inference (ICGI). 277--278.
[24]
}}Fernau, H. 2005. Algorithms for Learning Regular Expressions. In Proceedings of the 16th International Conference on Algorithmic Learning Theory. 297--311.
[25]
}}Finn, R., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., et al. 2006. Pfam: clans, Web tools and services. Nucleic Acids Resear. 34, D247--D251.
[26]
}}Florescu, D. 2005. Managing semi-structured data. ACM Queue 3, 8.
[27]
}}François, J.-M. 2006. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/.
[28]
}}Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: making XML count. In Proceedings of the SIGMOD Conference. 181--191.
[29]
}}Freitag, D. and McCallum, A. 2000. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the 17th National Conference on Artificial Intelligence. AAAI Press/The MIT Press, 584--589.
[30]
}}Garcia, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920--925.
[31]
}}Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Min. Knowl. Disc. 7, 23--56.
[32]
}}Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the Annual Symposium on Theoretical Aspects of Computer Science (STACS). 325--336.
[33]
}}Gold, E. 1967. Language identification in the limit. Inform. Cont. 10, 5, 447--474.
[34]
}}Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases. 436--445.
[35]
}}Gruber, H. and Holzer, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 39--50.
[36]
}}Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the ICDE Workshops. 81.
[37]
}}Hopcroft, J. and Ullman, J. 2007. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA.
[38]
}}Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases. 228--239.
[39]
}}Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases. 241--250.
[40]
}}Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Datab. Syst. 31, 3, 770--813.
[41]
}}Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML Web: A first study. In Proceedings of the 12th International World Wide Web Conference. 500--510.
[42]
}}Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data. ACM Press, 295--306.
[43]
}}Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence of disjunction, DTDs, and variables. Logic. Meth. Comput. Sci. 2, 3.
[44]
}}Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the International Workshop on Analogical and Inductive Inference. K. P. Jantke, Ed. Lecture Notes in Computer Science, vol. 397. Springer-Verlag, 18--44.
[45]
}}Quass, D., Widom, J., Goldman, R., et al. 1996. LORE: A Lightweight Object REpository for semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 549.
[46]
}}Rabiner, L. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.
[47]
}}Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350.
[48]
}}Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (Extended Abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases. D. Suciu and G. Vossen, Eds. Lecture Notes in Computer Science, vol. 1997. Springer, 171--183.
[49]
}}Sakakibara, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1, 15--45.
[50]
}}Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings of the 10th International Conference on Information and Knowledge Management. ACM Press, 159--166.
[51]
}}Thompson, H., Beech, D., Maloney, M., and Mendelsohn, N. 2001. XML Schema part 1: Structures. W3C.
[52]
}}Young-Lai, M. and Tompa, F. W. 2000. Stochastic grammatical inference of text database structure. Mach. Learn. 40, 2, 111--137.

Cited By

View all
  • (2025)Efficient interaction-based offline runtime verification of distributed systems with lifeline removalScience of Computer Programming10.1016/j.scico.2024.103230241(103230)Online publication date: Apr-2025
  • (2024)ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize FrameworkProceedings of the VLDB Endowment10.14778/3681954.368201917:11(3538-3550)Online publication date: 30-Aug-2024
  • (2023)InfeRE: Step-by-Step Regex Generation via Chain of InferenceProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE56229.2023.00111(1505-1515)Online publication date: 11-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 4, Issue 4
September 2010
173 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1841909
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2010
Accepted: 01 July 2010
Revised: 01 February 2010
Received: 01 November 2008
Published in TWEB Volume 4, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Regular expressions
  2. XML
  3. schema inference

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)5
Reflects downloads up to 02 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Efficient interaction-based offline runtime verification of distributed systems with lifeline removalScience of Computer Programming10.1016/j.scico.2024.103230241(103230)Online publication date: Apr-2025
  • (2024)ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize FrameworkProceedings of the VLDB Endowment10.14778/3681954.368201917:11(3538-3550)Online publication date: 30-Aug-2024
  • (2023)InfeRE: Step-by-Step Regex Generation via Chain of InferenceProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE56229.2023.00111(1505-1515)Online publication date: 11-Nov-2023
  • (2022)A universal approach for multi-model schema inferenceJournal of Big Data10.1186/s40537-022-00645-99:1Online publication date: 11-Aug-2022
  • (2022)Schema inference for multi-model dataProceedings of the 25th International Conference on Model Driven Engineering Languages and Systems10.1145/3550355.3552400(13-23)Online publication date: 23-Oct-2022
  • (2022)Self-Adapting Design and Maintenance of Multi-Model DatabasesProceedings of the 26th International Database Engineered Applications Symposium10.1145/3548785.3548810(9-15)Online publication date: 22-Aug-2022
  • (2022)Towards Theory for Real-World DataProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526066(261-276)Online publication date: 12-Jun-2022
  • (2022)SemMT: A Semantic-Based Testing Approach for Machine Translation SystemsACM Transactions on Software Engineering and Methodology10.1145/349048831:2(1-36)Online publication date: 1-Apr-2022
  • (2022)Learning Disjunctive Multiplicity Expressions and Disjunctive Generalize Multiplicity Expressions From Both Positive and Negative ExamplesThe Computer Journal10.1093/comjnl/bxac03766:7(1733-1748)Online publication date: 18-Apr-2022
  • (2021)TransRegexProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00111(1210-1222)Online publication date: 22-May-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media