[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Free access

T2: a customizable parallel database for multi-dimensional data

Published: 01 March 1998 Publication History

Abstract

As computational power and storage capacity increase, processing and analyzing large volumes of data play an increasingly important part in many domains of scientific research. Typical examples of large scientific datasets include long running simulations of time-dependent phenomena that periodically generate snapshots of their state (e.g. hydrodynamics and chemical transport simulation for estimating pollution impact on water bodies [4, 6, 20], magnetohydrodynamics simulation of planetary magnetospheres [32], simulation of a flame sweeping through a volume [28], airplane wake simulations [21]), archives of raw and processed remote sensing data (e.g. AVHRR [25], Thematic Mapper [17], MODIS [22]), and archives of medical images (e.g. confocal light microscopy, CT imaging, MRI, sonography).
These datasets are usually multi-dimensional. The data dimensions can be spatial coordinates, time, or experimental conditions such as temperature, velocity or magnetic field. The importance of such datasets has been recognized by several database research groups and vendors, and several systems have been developed for managing and/or visualizing them [2, 7, 14, 19, 26, 27, 29, 31].
These systems, however, focus on lineage management, retrieval and visualization of multi-dimensional datasets. They provide little or no support for analyzing or processing these datasets -- the assumption is that this is too application-specific to warrant common support. As a result, applications that process these datasets are usually decoupled from data storage and management, resulting in inefficiency due to copying and loss of locality. Furthermore, every application developer has to implement complex support for managing and scheduling the processing.
Over the past three years, we have been working with several scientific research groups to understand the processing requirements for such applications [1, 5, 6, 10, 18, 23, 24, 28]. Our study of a large set of applications indicates that the processing for such datasets is often highly stylized and shares several important characteristics. Usually, both the input dataset as well as the result being computed have underlying multi-dimensional grids, and queries into the dataset are in the form of ranges within each dimension of the grid. The basic processing step usually consists of transforming individual input items, mapping the transformed items to the output grid and computing output items by aggregating, in some way, all the transformed input items mapped to the corresponding grid point. For example, remote-sensing earth images are often generated by performing atmospheric correction on several days worth of raw telemetry data, mapping all the data to a latitude-longitude grid and selecting those measurements that provide the clearest view.
In this paper, we present T2, a customizable parallel database that integrates storage, retrieval and processing of multi-dimensional datasets. T2 provides support for many operations including index generation, data retrieval, memory management, scheduling of processing across a parallel machine and user interaction. It achieves its primary advantage from the ability to seamlessly integrate data retrieval and processing for a wide variety of applications and from the ability to maintain and process multiple datasets with different underlying grids. Most other systems for multi-dimensional data have focused on uniformly distributed datasets, such as images, maps, and dense multi-dimensional arrays. Many real datasets, however, are non-uniform or unstructured. For example, satellite data is a two dimensional strip that is embedded in a three dimensional space; water contamination studies use unstructured meshes to selectively simulate regions and so on. T2 can handle both uniform and non-uniform datasets.
T2 has been developed as a set of modular services. Since its structure mirrors that of a wide variety of applications, T2 is easy to customize for different types of processing. To build a version of T2 customized for a particular application, a user has to provide functions to pre-process the input data, map input data to elements in the output data, and aggregate multiple input data items that map to the same output element.
T2 presents a uniform interface to the end users (the clients of the database system). Users specify the dataset(s) of interest, a region of interest within the dataset(s), and the desired format and resolution of the output. In addition, they select the mapping and aggregation functions to be used. T2 analyzes the user request, builds a suitable plan to retrieve and process the datasets, executes the plan and presents the results in the desired format.
In Section 2 we first present several motivating applications and illustrate their common structure. Section 3 then presents an overview of T2, including its distinguishing features and a running example. Section 4 describes each database service in some detail. An example of how to customize several of the database services for a particular application is given in Section 5. T2 is a system in evolution. We conclude in Section 6 with a description of the current status of both the T2 design and the implementation of various applications with T2.

Cited By

View all
  • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 1-Aug-2024
  • (2021)Introduction to Digital Pathology from Historical Perspectives to Emerging PathomicsWhole Slide Imaging10.1007/978-3-030-83332-9_1(1-22)Online publication date: 30-Oct-2021
  • (2015)Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCIDDistributed and Parallel Databases10.1007/s10619-014-7149-733:3(277-317)Online publication date: 1-Sep-2015
  • Show More Cited By

Index Terms

  1. T2: a customizable parallel database for multi-dimensional data

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM SIGMOD Record
      ACM SIGMOD Record  Volume 27, Issue 1
      March 1998
      103 pages
      ISSN:0163-5808
      DOI:10.1145/273244
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 March 1998
      Published in SIGMOD Volume 27, Issue 1

      Check for updates

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)51
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 02 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 1-Aug-2024
      • (2021)Introduction to Digital Pathology from Historical Perspectives to Emerging PathomicsWhole Slide Imaging10.1007/978-3-030-83332-9_1(1-22)Online publication date: 30-Oct-2021
      • (2015)Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCIDDistributed and Parallel Databases10.1007/s10619-014-7149-733:3(277-317)Online publication date: 1-Sep-2015
      • (2013)Astronomical data processing in EXTASCIDProceedings of the 25th International Conference on Scientific and Statistical Database Management10.1145/2484838.2484875(1-4)Online publication date: 29-Jul-2013
      • (2013)The open connectome project data clusterProceedings of the 25th International Conference on Scientific and Statistical Database Management10.1145/2484838.2484870(1-11)Online publication date: 29-Jul-2013
      • (2013)Optimize Multidimensional Arrays Queries with Heterogeneous Replica MethodProceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and Storage10.1109/NAS.2013.43(272-276)Online publication date: 17-Jul-2013
      • (2013)Time travel in a scientific array databaseProceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)10.1109/ICDE.2013.6544817(98-109)Online publication date: 8-Apr-2013
      • (2011)ArrayStoreProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989351(253-264)Online publication date: 12-Jun-2011
      • (2011)Hybrid merge/overlap execution technique for parallel array processingProceedings of the EDBT/ICDT 2011 Workshop on Array Databases10.1145/1966895.1966898(20-30)Online publication date: 25-Mar-2011
      • (2011)Monte Carlo query processing of uncertain multidimensional array dataProceedings of the 2011 IEEE 27th International Conference on Data Engineering10.1109/ICDE.2011.5767887(936-947)Online publication date: 11-Apr-2011
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media