More Web Proxy on the site http://driver.im/

research-article

Simulation of database-valued markov chains using SimSQL

Authors:

Zografoula Vagena,

Subramanian Arumugam,

Christopher JermaineAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 637 - 648

https://doi.org/10.1145/2463676.2465283

Published: 22 June 2013 Publication History

Abstract

This paper describes the SimSQL system, which allows for SQLbased specification, simulation, and querying of database-valued Markov chains, i.e., chains whose value at any time step comprises the contents of an entire database. SimSQL extends the earlier Monte Carlo database system (MCDB), which permitted Monte Carlo simulation of static database-valued random variables. Like MCDB, SimSQL uses user-specified "VG functions" to generate the simulated data values that are the building blocks of a simulated database. The enhanced functionality of SimSQL is enabled by the ability to parametrize VG functions using stochastic tables, so that one stochastic database can be used to parametrize the generation of another stochastic database, which can parametrize another, and so on. Other key extensions include the ability to explicitly define recursive versions of a stochastic table and the ability to execute the simulation in a MapReduce environment. We focus on applying SimSQL to Bayesian machine learning.

References

[1]

20Newsgroups Dataset. Available at: http://people.csail.mit.edu/jrennie/20newsgroups/.

[2]

A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009.

Digital Library

[3]

B. Bahmani, K. Chakrabarti, and D. Xin. Fast Personalized Pagerank on MapReduce. In SIGMOD, 2011.

Digital Library

[4]

C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

Digital Library

[5]

D. M. Blei, A. N. Ng, and M. I. Jordan. Latent Dirichlet Allocation. JMLR, 3:993--1022, 2003.

Digital Library

[6]

Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. PVLDB, 3(1-2):285--296, 2010.

Digital Library

[7]

R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.

Digital Library

[8]

C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In PLDA, 2010.

Digital Library

[9]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318--1327, 2011.

Digital Library

[10]

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.

[11]

T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. McGraw-Hill, 2003.

Digital Library

[12]

S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating r and hadoop. In SIGMOD, 2010.

Digital Library

[13]

D. Deutch, C. Koch, and T. Milo. On probabilistic fixpoint and markov chain query languages. In PODS, pages 215--226, 2010.

Digital Library

[14]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on mapreduce. In ICDE, 2011.

Digital Library

[15]

J. Goldstein and P. Larson. Optimizing Queries Using Materialized Views: A practical, scalable solution. In SIGMOD, 2001.

Digital Library

[16]

T. J. Green and V. Tannen. Models for incomplete and probabilistic information. IEEE Data Eng. Bull., 29(1):17--24, 2006.

[17]

Y. E. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. TODS, 18(4):709--748, 1993.

Digital Library

[18]

R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. The Monte Carlo Database System: Stochastic analysis close to the data. TODS, 36(3):1--41, 2011.

Digital Library

[19]

U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. In ICDM, 2009.

Digital Library

[20]

O. Kennedy and S. Nath. Jigsaw: Efficient optimization over uncertain enterprise data. In SIGMOD, 2011.

Digital Library

[21]

C. Koch. On query algebras for probabilistic databases. SIGMOD Record, 37(4):78--85, 2008.

Digital Library

[22]

Y. Kwon, D. Nunley, J. D. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering for n-body simulations in a shared-nothing cluster. In SSDBM, 2010.

Digital Library

[23]

Z. Liu, Y. Zhang, E. Y. Chang, and M. Sun. Plda+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM TIST, 2(3):26:1--26:18, 2011.

Digital Library

[24]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Parallel Framework for Machine Learning. In UAI, 2010.

Digital Library

[25]

A. Mahout. Available at: http://mahout.apache.org/.

[26]

T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011.

Digital Library

[27]

B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. PVLDB, 2(2):1426--1437, 2009.

Digital Library

[28]

S. Papadimitriou and J. Sun. DisCo: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In ICDM, 2008.

Digital Library

[29]

I. Porteous, D. Newman, A. T. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed Gibbs sampling for Latent Dirichlet Allocation. In SIGKDD, 2008.

Digital Library

[30]

RHIPE: R and Hadoop Integrated Programming Environment. Available at: http://ml.stat.purdue.edu/rhipe/.

[31]

C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2010.

Digital Library

[32]

S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Distributed MAP inference for undirected graphical models. In NIPS Workshops, 2010.

[33]

A. J. Smola and S. Narayanamurthy. An Architecture for Parallel Topic models. PVLDB, 3(1):703--710, 2010.

Digital Library

[34]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009.

Digital Library

[35]

G. Wang, M. V. Salles, B. Sowell, X. Wang, T. Cao, A. Demers, J. Gehrke, and W. White. Behavioral simulations in mapreduce. PVLDB, 3(1-2):952--963, 2010.

Digital Library

[36]

M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and MCMC. In VLDB, 2010.

Digital Library

[37]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.

Digital Library

[38]

D. Z. Zhang, M. J. Franklin, M. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid in-database inference for declarative information extraction. In SIGMOD, 2011.

Digital Library

Cited By

Huot MGhavami MLew ASchaechtle UFreer CShelby ZRinard MSaad FMansinghka V(2024)GenSQL: A Probabilistic Programming System for Querying Generative Models of Database TablesProceedings of the ACM on Programming Languages10.1145/36564098:PLDI(790-815)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656409
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
https://doi.org/10.1007/s00778-024-00845-0
Show More Cited By

Index Terms

Simulation of database-valued markov chains using SimSQL
1. Information systems

Recommendations

Measure-Valued Differentiation for Stationary Markov Chains

We study general state-space Markov chains that depend on a parameter, say, <italic/>. Sufficient conditions are established for the stationary performance of such a Markov chain to be differentiable with respect to <italic/>. Specifically, we study the ...
Markov chains aggregation using discrete event optimization via simulation
SummerSim '19: Proceedings of the 2019 Summer Simulation Conference

Markov chains are an important form of stochastic system representation. Recent modeling techniques supporting discrete-event Markov model composition make it easy to build large Markov chains that are difficult to analyze due to state space explosion. ...
Variance reduction for additive functionals of Markov chains via martingale representations
Abstract
In this paper, we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete-time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
758
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huot MGhavami MLew ASchaechtle UFreer CShelby ZRinard MSaad FMansinghka V(2024)GenSQL: A Probabilistic Programming System for Querying Generative Models of Database TablesProceedings of the ACM on Programming Languages10.1145/36564098:PLDI(790-815)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656409
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
https://doi.org/10.1007/s00778-024-00845-0
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Grohe MKaminski BKatoen JLindner P(2022)Generative Datalog with Continuous DistributionsJournal of the ACM10.1145/355910269:6(1-52)Online publication date: 30-Aug-2022
https://dl.acm.org/doi/10.1145/3559102
Grohe MLindner P(2022)Independence in Infinite Probabilistic DatabasesJournal of the ACM10.1145/354952569:5(1-42)Online publication date: 10-Aug-2022
https://dl.acm.org/doi/10.1145/3549525
Yuan BJankov DZou JTang YBourgeois DJermaine C(2021)Tensor relational algebra for distributed machine learning system designProceedings of the VLDB Endowment10.14778/3457390.345739914:8(1338-1350)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.14778/3457390.3457399
Carmeli NGrohe MLindner PStandke CLibkin LPichler RGuagliardo P(2021)Tuple-Independent Representations of Infinite Probabilistic DatabasesProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458315(388-401)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3452021.3458315
Luo SJankov DYuan BJermaine CLi GLi ZIdreos SSrivastava D(2021)Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear AlgebraProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457317(1222-1234)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457317
Gao JXu YAgarwal PYang JLi GLi ZIdreos SSrivastava D(2021)Efficiently Answering Durability Prediction QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457305(591-604)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457305
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten