[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

The MADlib analytics library: or MAD skills, the SQL

Published: 01 August 2012 Publication History

Abstract

MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals.
MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.

References

[1]
D. Aloise, A. Deshpande, P. Hansen, et al. NP-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009.
[2]
E. Anderson, Z. Bai, C. Bischof, et al. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, third edition, 1999.
[3]
Apache Mahout. http://mahout.apache.org/.
[4]
D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed complexity. In FOCS, pages 405--414, 2009.
[5]
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007.
[6]
D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.
[7]
V. Borkar, M. Carey, R. Grover, et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.
[8]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[9]
J. Choi, J. Demmel, I. Dhillon, et al. ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1):1--15, 1996.
[10]
C.-T. Chu, S. K. Kim, Y.-A. Lin, et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.
[11]
J. Cohen, B. Dolan, M. Dunlap, et al. MAD Skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.
[12]
R. Feldman and J. Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, 2007.
[13]
X. Feng, A. Kumar, B. Recht, et al. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, pages 325--336, 2012.
[14]
G. Forney Jr. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268--278, 1973.
[15]
A. Ghoting, R. Krishnamurthy, E. Pednault, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242, 2011.
[16]
L. Gravano, P. Ipeirotis, H. Jagadish, et al. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.
[17]
G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.
[18]
D. Jurafsky and M. J. H. Speech and Language Processing. Pearson Prentice Hall, 2008.
[19]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001.
[20]
J. Langford. http://hunch.net/~vw/.
[21]
S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, 1982. Technical Report appeared much earlier in: Bell Telephone Laboratories Paper (1957).
[22]
Y. Low, J. Gonzalez, A. Kyrola, et al. GraphLab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.
[23]
M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. WALCOM: Algorithms and Computation, pages 274--285, 2009.
[24]
G. Malewicz, M. H. Austern, A. J. Bik, et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.
[25]
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, Mar. 2001.
[26]
A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263--304. Kluwer Academic Publishers, 2000.
[27]
N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89--100, 2007.
[28]
Oracle R Enterprise. http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html.
[29]
C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18(2):188--201, 2006.
[30]
C. Ordonez. Statistical model computation with UDFs. TKDE, 22(12):1752--1765, 2010.
[31]
C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000.
[32]
A. Pavlo, E. Paulson, A. Rasin, et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178. ACM, 2009.
[33]
Revloution Analytics. http://www.revolutionanalytics.com/.
[34]
B. Ripley. The R project in statistical computing. MSOR Connections, 1(1):23--25, 2001.
[35]
H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400--407, 1951.
[36]
R. T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press, 1996.
[37]
C. Sanderson. Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA, 2010.
[38]
M. Stonebraker, P. Brown, A. Poliakov, et al. The architecture of SciDB. In SSDBM, pages 1--16, 2011.
[39]
The PostgreSQL Global Development Group. PostgreSQL 9.1.4 Documentation, 2011.
[40]
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.
[41]
L. Tierney, A. J. Rossini, and N. Li. Snow: a parallel computing framework for the r system. IJPP, 37(1):78--90, Feb. 2009.
[42]
H. M. Wallach. Conditional random fields: An introduction. Technical report, Dept. of CIS, Univ. of Pennsylvania, 2004.
[43]
D. Wang, M. Franklin, M. Garofalakis, et al. Hybrid in-database inference for declarative information extraction. In SIGMOD, pages 517--528, 2011.
[44]
D. Z. Wang, M. J. Franklin, M. N. Garofalakis, et al. Querying probabilistic information extraction. PVLDB, 3(1):1057--1067, 2010.
[45]
M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in ScalOps, a higher order cloud computing language. In NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn), pages 389--396, 2011.
[46]
M. Zaharia, M. Chowdhury, T. Das, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, Jul 2011.
[47]
M. Zinkevich, M. Weimer, A. Smola, et al. Parallelized stochastic gradient descent. NIPS, 23(23):1--9, 2010.

Cited By

View all
  • (2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 1-Jan-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)PreVision: An Out-of-Core Matrix Computation System with Optimal Buffer ReplacementProceedings of the ACM on Management of Data10.1145/36392972:1(1-25)Online publication date: 26-Mar-2024
  • Show More Cited By
  1. The MADlib analytics library: or MAD skills, the SQL

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 12
    August 2012
    340 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2012
    Published in PVLDB Volume 5, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 14 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 1-Jan-2024
    • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
    • (2024)PreVision: An Out-of-Core Matrix Computation System with Optimal Buffer ReplacementProceedings of the ACM on Management of Data10.1145/36392972:1(1-25)Online publication date: 26-Mar-2024
    • (2024)Determining Exact Quantiles with Randomized SummariesProceedings of the ACM on Management of Data10.1145/36392802:1(1-26)Online publication date: 26-Mar-2024
    • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 1-Sep-2024
    • (2024)F-IVM: analytics over relational databases under updatesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00817-w33:4(903-929)Online publication date: 1-Jul-2024
    • (2023)SmartLite: A DBMS-Based Serving System for DNN Inference in Resource-Constrained EnvironmentsProceedings of the VLDB Endowment10.14778/3632093.363209517:3(278-291)Online publication date: 1-Nov-2023
    • (2023)To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling TasksProceedings of the VLDB Endowment10.14778/3611540.361161016:12(4018-4021)Online publication date: 1-Aug-2023
    • (2023)AQUA: Automatic Collaborative Query Processing in Analytical DatabaseProceedings of the VLDB Endowment10.14778/3611540.361160716:12(4006-4009)Online publication date: 1-Aug-2023
    • (2022)Coresets over multiple tables for feature-rich and data-efficient machine learningProceedings of the VLDB Endowment10.14778/3561261.356126716:1(64-76)Online publication date: 1-Sep-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media