More Web Proxy on the site http://driver.im/

research-article

The MADlib analytics library: or MAD skills, the SQL

Authors:

Joseph M. Hellerstein,

Christoper Ré,

Florian Schoppmann,

Daisy Zhe Wang,

Eugene Fratkin,

Aleksander Gorajek,

Arun KumarAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 5, Issue 12

Pages 1700 - 1711

https://doi.org/10.14778/2367502.2367510

Published: 01 August 2012 Publication History

Abstract

MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind.

In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals.

MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.

References

[1]

D. Aloise, A. Deshpande, P. Hansen, et al. NP-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009.

[2]

E. Anderson, Z. Bai, C. Bischof, et al. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, third edition, 1999.

[3]

Apache Mahout. http://mahout.apache.org/.

[4]

D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed complexity. In FOCS, pages 405--414, 2009.

[5]

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007.

[6]

D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.

[7]

V. Borkar, M. Carey, R. Grover, et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.

[8]

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[9]

J. Choi, J. Demmel, I. Dhillon, et al. ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1):1--15, 1996.

[10]

C.-T. Chu, S. K. Kim, Y.-A. Lin, et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.

[11]

J. Cohen, B. Dolan, M. Dunlap, et al. MAD Skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.

[12]

R. Feldman and J. Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, 2007.

[13]

X. Feng, A. Kumar, B. Recht, et al. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, pages 325--336, 2012.

[14]

G. Forney Jr. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268--278, 1973.

[15]

A. Ghoting, R. Krishnamurthy, E. Pednault, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242, 2011.

[16]

L. Gravano, P. Ipeirotis, H. Jagadish, et al. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.

[17]

G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.

[18]

D. Jurafsky and M. J. H. Speech and Language Processing. Pearson Prentice Hall, 2008.

[19]

J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001.

[20]

J. Langford. http://hunch.net/~vw/.

[21]

S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, 1982. Technical Report appeared much earlier in: Bell Telephone Laboratories Paper (1957).

[22]

Y. Low, J. Gonzalez, A. Kyrola, et al. GraphLab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.

[23]

M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. WALCOM: Algorithms and Computation, pages 274--285, 2009.

[24]

G. Malewicz, M. H. Austern, A. J. Bik, et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.

[25]

G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, Mar. 2001.

[26]

A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263--304. Kluwer Academic Publishers, 2000.

[27]

N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89--100, 2007.

[28]

Oracle R Enterprise. http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html.

[29]

C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18(2):188--201, 2006.

[30]

C. Ordonez. Statistical model computation with UDFs. TKDE, 22(12):1752--1765, 2010.

[31]

C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000.

[32]

A. Pavlo, E. Paulson, A. Rasin, et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178. ACM, 2009.

[33]

Revloution Analytics. http://www.revolutionanalytics.com/.

[34]

B. Ripley. The R project in statistical computing. MSOR Connections, 1(1):23--25, 2001.

[35]

H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400--407, 1951.

[36]

R. T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press, 1996.

[37]

C. Sanderson. Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA, 2010.

[38]

M. Stonebraker, P. Brown, A. Poliakov, et al. The architecture of SciDB. In SSDBM, pages 1--16, 2011.

[39]

The PostgreSQL Global Development Group. PostgreSQL 9.1.4 Documentation, 2011.

[40]

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.

[41]

L. Tierney, A. J. Rossini, and N. Li. Snow: a parallel computing framework for the r system. IJPP, 37(1):78--90, Feb. 2009.

[42]

H. M. Wallach. Conditional random fields: An introduction. Technical report, Dept. of CIS, Univ. of Pennsylvania, 2004.

[43]

D. Wang, M. Franklin, M. Garofalakis, et al. Hybrid in-database inference for declarative information extraction. In SIGMOD, pages 517--528, 2011.

[44]

D. Z. Wang, M. J. Franklin, M. N. Garofalakis, et al. Querying probabilistic information extraction. PVLDB, 3(1):1057--1067, 2010.

[45]

M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in ScalOps, a higher order cloud computing language. In NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn), pages 389--396, 2011.

[46]

M. Zaharia, M. Chowdhury, T. Das, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, Jul 2011.

[47]

M. Zinkevich, M. Weimer, A. Smola, et al. Parallelized stochastic gradient descent. NIPS, 23(23):1--9, 2010.

Cited By

Xing NCai SChen GLuo ZOoi BPei J(2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641212
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Koo KKim SKim WChoi YHan JKim BMoon B(2024)PreVision: An Out-of-Core Matrix Computation System with Optimal Buffer ReplacementProceedings of the ACM on Management of Data10.1145/36392972:1(1-25)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639297
Show More Cited By

The MADlib analytics library: or MAD skills, the SQL
1. Information systems
  1. Data management systems

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 5, Issue 12

August 2012

340 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2012

Published in PVLDB Volume 5, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

167
Total Citations
View Citations
900
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xing NCai SChen GLuo ZOoi BPei J(2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641212
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Koo KKim SKim WChoi YHan JKim BMoon B(2024)PreVision: An Out-of-Core Matrix Computation System with Optimal Buffer ReplacementProceedings of the ACM on Management of Data10.1145/36392972:1(1-25)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639297
Chen ZGuan HSong SHuang XWang CWang J(2024)Determining Exact Quantiles with Randomized SummariesProceedings of the ACM on Management of Data10.1145/36392802:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639280
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00845-0
Kara ANikolic MOlteanu DZhang H(2024)F-IVM: analytics over relational databases under updatesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00817-w33:4(903-929)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00817-w
Lin QWu SZhao JDai JShi MChen GLi F(2023)SmartLite: A DBMS-Based Serving System for DNN Inference in Resource-Constrained EnvironmentsProceedings of the VLDB Endowment10.14778/3632093.363209517:3(278-291)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632095
Schäfer NGjurovski DDavitkova AMichel S(2023)To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling TasksProceedings of the VLDB Endowment10.14778/3611540.361161016:12(4018-4021)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611610
Peng YChen KShou LJiang DChen G(2023)AQUA: Automatic Collaborative Query Processing in Analytical DatabaseProceedings of the VLDB Endowment10.14778/3611540.361160716:12(4006-4009)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611607
Wang JChai CTang NLiu JLi G(2022)Coresets over multiple tables for feature-rich and data-efficient machine learningProceedings of the VLDB Endowment10.14778/3561261.356126716:1(64-76)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3561261.3561267
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents