Abstract
XML Indexing and Storage (XMIS) techniques are crucial for the functionality and the overall performance of an XML database management system (XDBMS). Because of the complexity of XQuery and performance demands of XML query processing, efficient path processing operators—including those for tree-pattern queries (so-called twigs)—are urgently needed for which tailor-made indexes and their flexible use are indispensable. Although XML indexing and storage are standard problems and, of course, manifold approaches have been proposed in the last decade, adaptive and broad-enough solutions for satisfactory query evaluation support of all path processing operators are missing in the XDBMS context. Therefore, we think that it is worthwhile to take a step back and look at the complete picture to derive a salient and holistic solution. To do so, we first compile an XMIS wish list containing what—in our opinion—are essential functional storage and indexing requirements in a modern XDBMS. With these desiderata in mind, we then develop a new XMIS scheme, which—by reconsidering previous work—can be seen as a practical and general approach to XML storage and indexing. Interestingly, by working on both problems at the same time, we can make the storage and index managers live in a kind of symbiotic partnership, because the document store re-uses ideas originally proposed by the indexing community and vice versa. The XMIS scheme is implemented in XTC, an XDBMS used for empirical tests.
Similar content being viewed by others
Notes
To avoid an application-specific focus, these properties are not ranked. For a universal approach, all properties should have the same importance.
All variations of document stores and index types are implemented using B*-trees. With code reuse for the base structures, the tree entries only differ in the representation of keys and values.
In the following, we are only interested in the path up to the root for a given PCR. Therefore, the relative order among siblings is not relevant, e.g., all permutations of elements issn, name, publisher, and issue as children of journal may appear in the document.
For uniprot, the structure-related saving of the pc and po formats consists of 465 and 731 Mbytes w.r.t. the naive format. Content compression would reduce the content part by 23–35%, in addition.
Although we simplified the path specification for presentation purposes, their XPath equivalent is used to express twig queries as well by simply combining several path specifications (join) and, thereby, allowing their application for twig operators.
Note, subscript D and type T are omitted where non-ambiguous.
References
Arion A, Bonifati A, Manolescu I, Pugliese A (2008) Path summaries and path partitioning in modern XML databases. World Wide Web 11(1):117–151
Balmin A, Özcan F, Beyer KS, Chochrane RJ, Pirahesh H (2004) A framework for using materialized XPath views in XML query processing. In: Proc VLDB, pp 60–71
Beyer K, Cochrane R, Josifovski V, Kleewein J, Lapis G, Lohman GM, Lyle R, Özcan F, Pirahesh H, Seemann N, Truong TC, Van der Linden B, Vickery B, Zhang C, System RX (2005) One part relational, one part XML. In: Proc SIGMOD, pp 358–374
Boncz P, Grust T, van Keulen M, Manegold S, Rittinger J, Teubner J (2006) MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: Proc SIGMOD, pp 479–490
Bruno N, Koudas N, Srivastava D (2002) Holistic twig joins: optimal XML pattern matching. In: Proc SIGMOD, pp 310–321
Chen Q, Lim A, Ong KW (2003) D(k)-index: an adaptive structural summary for graph-structured data. In: Proc SIGMOD, pp 134–144
Draper D, Frankhauser P, Fernandéz M, Malhotra A, Rose K, Rys M, Siméon J, Wadler P (2004) XQuery 1.0 and XPath 2.0 formal semantics
Fomichev A, Grinev M, Kuznetsov S (2006) Sedna: a native XML DBMS. In: Proc SOFSEM, pp 272–281
Goldman R, Widom J (1997) DataGuides: enabling query formulation and optimization in semistructured databases. In: Proc VLDB, pp 436–445
Graefe G, Larson P-A (2001) B-tree indexes and CPU caches. In: Proc ICDE, pp 349–358
Grust T, van Keulen M, Teubner J (2003) Staircase join: teach a relational DBMS to watch its (axis) steps. In: Proc VLDB, pp 524–525
Härder T, Haustein MP, Mathis C, Wagner M (2007) Node labeling schemes for dynamic XML documents reconsidered. Data Knowl Eng 60(1):126–149
Härder T, Mathis C, Schmidt K (2007) Comparison of complete and elementless native storage of XML documents. In: Proc IDEAS, pp 102–113
Haustein MP, Härder T, Mathis C, Wagner M (2005) DeweyIDs—the key to fine-grained management of XML documents. In: Proc 20th Brazilian symposium on databases, pp 85–99
Haustein MP, Härder T (2007) An efficient infrastructure for native transactional XML processing. Data Knowl Eng 61(3):500–523
Jiang H, Wang W, Lu H, Yu Xu J (2003) Holistic twig joins on indexed XML documents. In: Proc VLDB, pp 273–284
Kaushik R, Bohannon P, Naughton JF, Korth HF (2002) Covering indexes for branching path queries. In: Proc SIGMOD, pp 133–144
Kaushik R, Shenoy P, Bohannon P, Gudes E (2002) Exploiting local similarity for indexing paths in graph-structured data. In: Proc ICDE, pp 129–140
Kaushik R, Krishnamurthy R, Naughton JF, Ramakrishnan R (2004) On the integration of structure indexes and inverted lists. In: Proc SIGMOD, pp 779–790
Li H-G, Aghili SA, Agrawal D, El Abbadi A (2006) FLUX: content-and-structure matching of XPath queries with range predicates. In: Proc XSym. LNCS, vol 4156, pp 61–76
Mathis C (2009) Storing, indexing, and querying XML documents in native XML database management systems. PhD thesis, Verlag Dr Hut
Mathis C, Härder T, Schmidt K (2009) Storing and indexing XML documents upside down. Comput Sci Res Dev 24(1–2):51–68
McHugh J, Abiteboul S (1997) Lore: a database management system for semistructured data. SIGMOD Rec 26:54–66
Meier W (2002) eXist: an open source native xml database. Proc Web, Web-services, and database systems. Lect Notes Comput Sci 2593:169–183
Miklau G. XML data repository. www.cs.washington.edu/research/xmldatasets
Milo T, Suciu D (1999) Index structures for path expressions. In: Proc ICDT, pp 277–295
O’Neil PE, Pal S, Cseri I, Schaller G, Westbury N (2004) ORDPATHs: insert-friendly XML node labels. In: Proc SIGMOD, pp 903–908
Prakash S, Bhowmick SS, Madria S (2006) Efficient recursive XML query processing using relational database systems. Data Knowl Eng 58(3):207–242
Prasad KH, Kumar PS (2005) Efficient indexing and querying of XML data using modified prüfer sequences. In: Proc CIKM, pp 397–404
Sample N, Cooper BF, Franklin MJ, Hjaltason GR, Shadmon M, Cohen L (2002) Managing complex and varied data with the IndexFabric(tm). In: Proc ICDE, pp 492–493
Schmidt AR, Waas F, Kersten ML, Carey MJ, Manolescu I, Busse R (2002) XMark: a benchmark for XML data management. In: Proc VLDB, pp 974–985
Document Object Model (DOM) Level 3 core specification, W3C recommendations (Jan 2004)
Brownell D (2002) SAX2. O’Reilly Media
Wang H, Park S, Fan W, PS Yu (2003) ViST: a dynamic index method for querying XML data by tree structures. In: Proc SIGMOD, pp 110–121
Wang W, Jiang H, Wang H, Lin X, Lu H, Li J (2005) Efficient processing of XML path queries using the disk-based F&B-index. In: Proc VLDB, pp 145–165
XQuery 1.0 (2007) An XML query language. W3C recommendation (Jan 2007)
XQuery Update Facility 1.0 (2011) W3C recommendation (17 March 2011)
Yoshikawa M, Amagasa T, Shimura T, Uemura S (2001) XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM TOIT 1(1):110–141
Zhang N, Kacholia V, Özsu T (2004) A succinct physical storage scheme for efficient evaluation of path queries in XML. In: Proc ICDE, pp 54–63
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mathis, C., Härder, T., Schmidt, K. et al. XML indexing and storage: fulfilling the wish list. Comput Sci Res Dev 30, 51–68 (2015). https://doi.org/10.1007/s00450-012-0204-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-012-0204-6