[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1182635.1164139acmconferencesArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

Inference of concise DTDs from XML data

Published: 01 September 2006 Publication History

Abstract

We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas.

References

[1]
{1} Top ten most critical web application vulnerabilities. Technical report, Open Web Application Security Project, Jan. 2004.
[2]
{2} S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kaufmann, 1999.
[3]
{3} H. Ahonen. Generating grammars for structured documents using grammatical inference methods. Report A-1996-4, Univ. of Finland, 1996.
[4]
{4} D. Angluin and C. H. Smith. Inductive inference: theory and methods. ACM Computing Surveys, 15(3):237-269, 1983.
[5]
{5} D. Barbosa, A. Mendelzon, J. Keenleyside, and K. Lyons. ToXgene: An extensible template-based data generator for XML. WebDB, 49-54, 2002.
[6]
{6} D. Barbosa, L. Mignet, and P. Veltri. Studying the XML Web: gathering statistics from an XML sample. WWW, 8(4):413-438, 2005.
[7]
{7} M. Benedikt, W. Fan, and F. Geerts. XPath satisfiability in the presence of DTDs. PODS, 25-36, 2005.
[8]
{8} P. Bernstein. Applying model management to classical meta data problems. CIDR, 2003.
[9]
{9} W. Martens, F. Neven, and T. Schwentick and G.J. Bex. Expressiveness and Complexity of XML Schema. ACM TODS, 31(3), 2006.
[10]
{10} G.J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: a practical study. WebDB, 79-84, 2004.
[11]
{11} A. Brazma. Efficient identification of regular expressions from representative examples. COLT, 236-242, 1993.
[12]
{12} A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Information and computation, 140(2):229-253, 1998.
[13]
{13} P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. ICDT, LNCS 1186, 336-350, 1997.
[14]
{14} B. Chidlovskii. Schema extraction from XML: a grammatical inference approach. KRDB, 2001.
[15]
{15} J. Clark and M. Murata. RELAX NG Specification. OASIS, Dec. 2001.
[16]
{16} M. Delgado and J. Morais. Approximation to the smallest regular expression for a given regular language. CIAA, LNCS 3317, 312-314, 2004.
[17]
{17} A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with STORED. SIGMOD, 431-442, 1999.
[18]
{18} A. Ehrenfeucht and P. Zeiger. Complexity measures for regular expressions. Journal of computer and system sciences, 12:134-146, 1976.
[19]
{19} M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. ICDE, 14-23, 1998.
[20]
{20} H. Fernau. Extracting minimum length Document Type Definitions is NP-hard. ICGI, LNAI 3264, 277-278, 2004.
[21]
{21} H. Fernau. Algorithms for learning regular expressions. ALT, LNCS 3734, 297-311, 2005.
[22]
{22} D. Florescu. Managing semistructured data. ACM Queue, 3(8), Oct. 2005.
[23]
{23} P. Garcia and E. Vidal. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. Patt. Anal. and Mach. Int., 12(9):920-925, 1990.
[24]
{24} M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: learning document type descriptors from XML document collections. Data mining and knowledge discovery, 7:23-56, 2003.
[25]
{25} E. M. Gold. Language identification in the limit. Information and Control, 10(5):447-474, 1967.
[26]
{26} R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In VLDB, 436-445, 1997.
[27]
{27} Y.-S. Han and D. Wood. Shorter regular expressions from finite-state automata. In CIAA, 141-152, 2005.
[28]
{28} S. Hinkelman. Business Integration - Information Conformance Statements (BI-ICS). IBM, 2005. http://www.ibm.com/developerworks/xml/library/x-biics/
[29]
{29} J.E. Hopcroft and J.D. Ullman. Introduction to automata theory, languages and computation. Addison-Wesley, 1979.
[30]
{30} C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. VLDB, 228-239, 2004.
[31]
{31} I. Manolescu, D. Florescu, and D. Kossmann. Answering XML queries on heterogeneous data sources. VLDB, 241-250, 2001.
[32]
{32} S. Melnik. Generic model management: concepts and algorithms. Ph.D., LNCS 2967, Univ. of Leipzig, 2004.
[33]
{33} L. Mignet, D. Barbosa, and P. Veltri. The XML web: a first study. WWW, 500-510, 2003.
[34]
{34} G. Miklau. XMLData Repository, Nov. 2002. http://www.cs.washington.edu/research/xmldatasets/
[35]
{35} J.K. Min, J.Y. Ahn, and C.W. Chung. Efficient extraction of schemas for XML documents. Inf. Process. Lett., 85(1):7-12, 2003.
[36]
{36} S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. SIGMOD, 295-306, 1998.
[37]
{37} S. Nestorov, J. D. Ullman, J. Wiener, and S. Chawathe. Representative objects: concise representations of semistructured, hierarchial data. ICDE, 79-90, 1997.
[38]
{38} F. Neven and T. Schwentick. XPath containment in the presence of disjunction, DTDs, and variables. ICDT, 315-329, 2003.
[39]
{39} A. Ngu, D. Rocco, T. Critchlow, and D. Buttler. Automatic discovery and inferencing of complex bioinformatics web interfaces. WWW, 463-493, 2005.
[40]
{40} P. Oaks and A. ter Hofstede. Guided Interaction: a language and method for incremental revelation of software interfaces for ad hoc interaction. Business Process Management, LNCS 3812, 3-17, 2005.
[41]
{41} L. Pitt. Inductive inference, DFAs, and computational complexity. AII, 18-44, 1989.
[42]
{42} D. Quass et al. LORE: a Lightweight Object Repository for semistructured data. SIGMOD, 549, 1996.
[43]
{43} E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10, 2001.
[44]
{44} A. Sahuguet. Everything you ever wanted to know about DTDs, but were afraid to ask. WebDB, 2000.
[45]
{45} Y. Sakakibara. Recent advances of grammatical inference. Theor. Comput. Sci., 185(1):15-45, 1997.
[46]
{46} J. Clark. Trang. Multi-format schema converter. http://www.thaiopensource.com/relaxng/trang.html
[47]
{47} J. Sankey and R.K. Wong. Structural inference for semistructured data. CIKM, 159-166, 2001.
[48]
{48} H.S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema part 1: structures. 2001.
[49]
{49} E. van der Vlist. XML Schema. O'Reilly, 2002.
[50]
{50} G. Wang, M. Liu, J. Yu, B. Sun, G. Yu, J. Lv, and H. Lu. Effective schema-based XML query optimization techniques. In IDEAS, 230-235, 2003.

Cited By

View all
  • (2021)Reducing Ambiguity in Json Schema DiscoveryProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452801(1732-1744)Online publication date: 9-Jun-2021
  • (2019)An effective algorithm for learning single occurrence regular expressions with interleavingProceedings of the 23rd International Database Applications & Engineering Symposium10.1145/3331076.3331100(1-10)Online publication date: 10-Jun-2019
  • (2018)Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema DataProceedings of the 22nd International Database Engineering & Applications Symposium10.1145/3216122.3216126(45-53)Online publication date: 18-Jun-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
VLDB '06: Proceedings of the 32nd international conference on Very large data bases
September 2006
1269 pages

Sponsors

  • SIGMOD: ACM Special Interest Group on Management of Data
  • K.I.S.S. SIG on Databases
  • AJU Information Technology Co., Ltd
  • US Army ITC-PAC Asian Research Office
  • Google Inc.
  • The Database Society of Japan
  • Samsung SOS
  • Advanced Information Technology Research Center
  • Naver
  • Microsoft: Microsoft
  • Korea Info Sci Society: Korea Information Science Society
  • SK telecom
  • Systems Applications Products
  • ORACLE: ORACLE
  • International Business Management
  • Air Force Office of Scientific Research/Asian Office of Aerospace R&D
  • Kosef
  • Kaist
  • LG Electronics
  • CCF-DBS

Publisher

VLDB Endowment

Publication History

Published: 01 September 2006

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Reducing Ambiguity in Json Schema DiscoveryProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452801(1732-1744)Online publication date: 9-Jun-2021
  • (2019)An effective algorithm for learning single occurrence regular expressions with interleavingProceedings of the 23rd International Database Applications & Engineering Symposium10.1145/3331076.3331100(1-10)Online publication date: 10-Jun-2019
  • (2018)Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema DataProceedings of the 22nd International Database Engineering & Applications Symposium10.1145/3216122.3216126(45-53)Online publication date: 18-Jun-2018
  • (2018)Auto-DetectProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196889(1377-1392)Online publication date: 27-May-2018
  • (2016)A Generic Framework for Engaging Online Data Sources in Introductory Programming CoursesProceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education10.1145/2899415.2899437(136-141)Online publication date: 11-Jul-2016
  • (2013)Conservative type extensions for XML dataTransactions on Large-Scale Data- and Knowledge-centered systems IX10.5555/2554635.2554639(65-94)Online publication date: 1-Jan-2013
  • (2013)Almost-linear inclusion for XML regular expression typesACM Transactions on Database Systems10.1145/2508020.250802238:3(1-45)Online publication date: 5-Sep-2013
  • (2013)Optimizing XML querying using type-based document projectionACM Transactions on Database Systems10.1145/2445583.244558738:1(1-45)Online publication date: 26-Apr-2013
  • (2013)Learning regular expressions to template-based FAQ retrieval systemsKnowledge-Based Systems10.1016/j.knosys.2013.08.01853(108-128)Online publication date: 1-Nov-2013
  • (2012)Finding optimal probabilistic generators for XML collectionsProceedings of the 15th International Conference on Database Theory10.1145/2274576.2274591(127-139)Online publication date: 26-Mar-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media