[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3035918.3054775acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Data Management in Machine Learning: Challenges, Techniques, and Systems

Published: 09 May 2017 Publication History

Abstract

Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related challenges that arise in ML workloads, and has built several systems for advanced analytics. This tutorial provides a comprehensive review of such systems and analyzes key data management challenges and techniques. We focus on three complementary lines of work: (1) integrating ML algorithms and languages with existing data systems such as RDBMSs, (2) adapting data management-inspired techniques such as query optimization, partitioning, and compression to new systems that target ML workloads, and (3) combining data management and ML ideas to build systems that improve ML lifecycle-related tasks. Finally, we identify key open data management challenges for future research in this important area.

References

[1]
Apache Mahout. mahout.apache.org.
[2]
Microsoft AzureML Studio. studio.azureml.net.
[3]
Oracle R Enterprise. www.oracle.com/technetwork/database/database-technologies/r/r-enterprise.
[4]
M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. OSDI, 2016.
[5]
D. Agrawal et al. SparkBench - A Spark Performance Testing Suite. In TPCTC, 2015.
[6]
M. Akdere et al. The Case for Predictive Database Systems: Opportunities and Challenges. In CIDR, 2011.
[7]
A. Alexandrov et al. Implicit Parallelism through Deep Language Embedding. In SIGMOD, 2015.
[8]
M. Anderson and M. Cafarella. Input Selection for Fast Feature Engineering. In ICDE, 2016.
[9]
M. Anderson et al. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.
[10]
M. Armbrust et al. Introduction to spark 2.0 for database researchers. In SIGMOD, 2016.
[11]
A. Ashari et al. On Optimizing Machine Learning Workloads via Kernel Fusion. In PPoPP, 2015.
[12]
P. Bailis et al. MacroBase: Prioritizing Attention in Fast Data. In SIGMOD, 2017.
[13]
C. K. Baru et al. Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data. In TPCTC, 2014.
[14]
J. Bergstra et al. Theano: a CPU and GPU Math Expression Compiler. In SciPy, 2010.
[15]
M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB, 7(7), 2014.
[16]
M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull., 37(3), 2014.
[17]
M. Boehm et al. SystemML: Declarative Machine Learning on Spark. PVLDB, 9(13), 2016.
[18]
V. R. Borkar et al. Declarative Systems for Large-Scale Machine Learning. IEEE Data Eng. Bull., 35(2), 2012.
[19]
Z. Cai et al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013.
[20]
Z. Cai et al. A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms. In SIGMOD, 2014.
[21]
L. Chen et al. Towards Linear Algebra over Normalized Data. CoRR, abs/1612.07448, 2016.
[22]
Y. Cheng et al. GLADE: Big Data Analytics Made Easy. In SIGMOD, 2012.
[23]
B.-G. Chun et al. Dolphin: Runtime Optimization for Distributed Machine Learning. In ICML Workshop MLSystems, 2016.
[24]
J. Cohen et al. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2), 2009.
[25]
R. Collobert et al. Torch7: A Matlab-like Environment for Machine Learning. In NIPS Workshop BigLearn, 2011.
[26]
T. Condie et al. Machine Learning for Big Data. In SIGMOD, 2013.
[27]
D. Crankshaw et al. The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. In CIDR, 2015.
[28]
D. Crankshaw et al. Clipper: A Low-Latency Online Prediction Serving System. In NSDI, 2017.
[29]
A. Crotty et al. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12), 2015.
[30]
A. Crotty et al. Vizdom: Interactive Analytics through Pen and Touch. PVLDB, 8(12), 2015.
[31]
C. De Sa et al. DeepDive: Declarative Knowledge Base Construction. SIGMOD Record, 45(1), 2016.
[32]
A. Deshpande and S. Madden. MauveDB: Supporting Model-based User Views in Database Systems. In SIGMOD, 2006.
[33]
P. Domingos. A Few Useful Things to Know About Machine Learning. ACM CACM, 55(10), 2012.
[34]
S. Duan and S. Babu. Processing Forecasting Queries. In VLDB, 2007.
[35]
T. Elgamal et al. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR, 2017.
[36]
A. Elgohary et al. Compressed Linear Algebra for Large-Scale Machine Learning. PVLDB, 9(12), 2016.
[37]
S. Ewen et al. Spinning Fast Iterative Data Flows. PVLDB, 5(11), 2012.
[38]
X. Feng et al. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012.
[39]
U. Fischer. Forecasting in Database Systems. PhD thesis, Technische Universitaet Dresden, 2014.
[40]
Z. J. Gao et al. The BUDS Language for Distributed Bayesian Machine Learning. In SIGMOD, 2017.
[41]
T. Ge and S. B. Zdonik. A Skip-list Approach for Efficiently Processing Forecasting Queries. PVLDB, 1(1), 2008.
[42]
L. Getoor and B. Taskar. Introduction to Statistical Relational Learning. The MIT Press, 2007.
[43]
J. E. Gonzalez et al. Asynchronous Complex Analytics in a Distributed Dataflow Architecture. CoRR, 2015.
[44]
I. Guyon et al. Feature Extraction: Foundations and Applications. New York: Springer-Verlag, 2001.
[45]
T. Hastie et al. The Elements of Statistical Learning: Data mining, Inference, and Prediction. Springer-Verlag, 2001.
[46]
J. M. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB, 5(12), 2012.
[47]
H. Herodotou et al. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.
[48]
B. Huang et al. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD, 2013.
[49]
B. Huang et al. Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances. PVLDB, 9(3), 2015.
[50]
B. Huang et al. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD, 2015.
[51]
S. Huang et al. The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In ICDE Workshop WISS, 2010.
[52]
Y. Jia et al. Caffe: Convolutional Architecture for Fast Feature Embedding. CoRR, 2014.
[53]
R. J. L. John et al. Ava: From Data to Insights Through Conversation. In CIDR, 2017.
[54]
D. Kernert et al. SLACID - Sparse Linear Algebra in a Column-Oriented In-Memory Database System. In SSDBM, 2014.
[55]
D. Kernert et al. SpMacho - Optimizing Sparse Linear Algebra Expressions with Probabilistic Density Estimation. In EDBT, 2015.
[56]
D. Kernert et al. Topology-Aware Optimization of Big Sparse Matrices and Matrix Multiplications on Main-Memory Systems. In ICDE, 2016.
[57]
M. A. Khamis et al. FAQ: Questions Asked Frequently. In PODS, 2016.
[58]
M. Kim. TensorDB and Tensor-Relational Model (TRM) for Efficient Tensor-Relational Operations. PhD thesis, Arizona State University, 2014.
[59]
M. L. Koc and C. Ré. Incrementally maintaining classification using an RDBMS. PVLDB, 4(5), 2011.
[60]
T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.
[61]
S. Krishnan et al. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB, 9(12), 2016.
[62]
A. Kumar et al. Demonstration of Santoku: Optimizing Machine Learning over Normalized Data. PVLDB, 8(12), 2015.
[63]
A. Kumar et al. Learning Generalized Linear Models Over Normalized Data. In SIGMOD, 2015.
[64]
A. Kumar et al. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 2015.
[65]
A. Kumar et al. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. In SIGMOD, 2016.
[66]
Y. Lecun et al. Deep Learning. Nature, 521:436--444, 5 2015.
[67]
Y. Low et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB, 5(8), 2012.
[68]
R. Marcus and O. Papaemmanouil. WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases. PVLDB, 9(10), 2016.
[69]
C. Mayfield et al. ERACER: A Database Approach for Statistical Inference and Data Cleaning. In SIGMOD, 2010.
[70]
X. Meng et al. MLlib: Machine Learning in Apache Spark. JMLR, 17(1), 2016.
[71]
H. Miao et al. ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning. CoRR, 2016.
[72]
S. Narayanamurthy et al. Towards Resource-Elastic Machine Learning. In NIPS Workshop BigLearn, 2013.
[73]
M. Nikolic et al. LINVIEW: Incremental View Maintenance for Complex Analytical Queries. In SIGMOD, 2014.
[74]
F. Niu et al. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS, 2011.
[75]
F. Niu et al. Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS. PVLDB, 4(6), 2011.
[76]
C. Ordonez et al. One-pass Data Mining Algorithms in a DBMS with UDFs. In SIGMOD, 2011.
[77]
X. Pan et al. Hemingway: Modeling Distributed Optimization Algorithms. In NIPS Workshop MLSystems, 2016.
[78]
S. Papadopoulos et al. The TileDB Array Data Storage Manager. PVLDB, 10(4), 2016.
[79]
A. Pavlo et al. On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems. PVLDB, 5(2), 2011.
[80]
A. Pavlo et al. Self-Driving Database Management Systems. In CIDR, 2017.
[81]
S. Prasad et al. Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction. In SIGMOD, 2015.
[82]
C. Qin and F. Rusu. Speculative Approximations for Terascale Distributed Gradient Descent Optimization. In SIGMOD DanaC, 2012.
[83]
C. Ré et al. Feature Engineering for Knowledge Base Construction. IEEE Data Eng. Bull., 2014.
[84]
S. Rendle. Scaling Factorization Machines to Relational Data. PVLDB, 6(5), 2013.
[85]
S. Schelter et al. "All Roads Lead to Rome:" Optimistic Recovery for Distributed Iterative Data Processing. In CIKM, 2013.
[86]
S. Schelter et al. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. NIPS Workshop MLSystems, 2016.
[87]
M. Schleich et al. Learning Linear Regression Models over Factorized Joins. In SIGMOD, 2016.
[88]
J. Shin et al. Incremental Knowledge Base Construction Using DeepDive. PVLDB, 8(11), 2015.
[89]
P. Shivam et al. Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications. In VLDB, 2006.
[90]
E. Soroush et al. ArrayStore: A Storage Manager for Complex Parallel Array Processing. In SIGMOD, 2011.
[91]
E. R. Sparks et al. Automating Model Search for Large Scale Machine Learning. In SoCC, 2015.
[92]
E. R. Sparks et al. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. In ICDE, 2017.
[93]
M. Stonebraker et al. The Architecture of SciDB. In SSDBM, 2011.
[94]
A. K. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML, 2011.
[95]
R. Taft et al. GenBase: A Complex Analytics Genomics Benchmark. In SIGMOD, 2014.
[96]
C. Thornton et al. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In KDD, 2013.
[97]
M. Vartak et al. Supporting Fast Iteration in Model Building. In NIPS Workshop LearningSys, 2015.
[98]
M. Vartak et al. MODELDB: A System for Machine Learning Model Management. In SIGMOD Workshop HILDA, 2016.
[99]
S. Venkataraman et al. SparkR: Scaling R Programs with Spark. In SIGMOD, 2016.
[100]
H. Wang et al. ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams. In VLDB, 2003.
[101]
W. Wang et al. Database Meets Deep Learning: Challenges and Opportunities. SIGMOD Record, 45(2), 2016.
[102]
F. Wolf et al. Extending Database Task Schedulers for Multi-threaded Application Code. In SSDBM, 2015.
[103]
D. Yan et al. Big Graph Analytics Systems. In SIGMOD, 2016.
[104]
O. D. L. Yejas et al. Big R: Large-Scale Analytics on Hadoop Using R. In Big Data, 2014.
[105]
C. Zhang et al. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.
[106]
C. Zhang and C. Ré. Towards High-throughput Gibbs Sampling at Scale: A Study Across Storage Managers. In SIGMOD, 2013.
[107]
C. Zhang and C. Ré. DimmWitted: A Study of Main-Memory Statistical Analytics. PVLDB, 7(12), 2014.
[108]
M. Zhang et al. Measuring and Optimizing Distributed Array Programs. PVLDB, 9(12), 2016.
[109]
Y. Zhang et al. RIOT: I/O-Efficient Numerical Computing without SQL. In CIDR, 2009.
[110]
Y. Zhang et al. Storing Matrices on Disk: Theory and Practice Revisited. PVLDB, 4(11), 2011.
[111]
Y. Zhang and J. Yang. Optimizing I/O for Big Array Analytics. PVLDB, 5(8), 2012.

Cited By

View all
  • (2024)Recent Progress on Surface Water Quality Models Utilizing Machine Learning TechniquesWater10.3390/w1624361616:24(3616)Online publication date: 15-Dec-2024
  • (2024)Parametric seasonal-trend autoregressive neural network for long-term crop price forecastingPLOS ONE10.1371/journal.pone.031119919:9(e0311199)Online publication date: 26-Sep-2024
  • (2024)Developing WasteSAM: A novel approach for accurate construction waste image segmentation to facilitate efficient recyclingWaste Management & Research: The Journal for a Sustainable Circular Economy10.1177/0734242X241290743Online publication date: 18-Nov-2024
  • Show More Cited By

Index Terms

  1. Data Management in Machine Learning: Challenges, Techniques, and Systems

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
      May 2017
      1810 pages
      ISBN:9781450341974
      DOI:10.1145/3035918
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 May 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data management
      2. machine learning

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS'17
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)325
      • Downloads (Last 6 weeks)23
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Recent Progress on Surface Water Quality Models Utilizing Machine Learning TechniquesWater10.3390/w1624361616:24(3616)Online publication date: 15-Dec-2024
      • (2024)Parametric seasonal-trend autoregressive neural network for long-term crop price forecastingPLOS ONE10.1371/journal.pone.031119919:9(e0311199)Online publication date: 26-Sep-2024
      • (2024)Developing WasteSAM: A novel approach for accurate construction waste image segmentation to facilitate efficient recyclingWaste Management & Research: The Journal for a Sustainable Circular Economy10.1177/0734242X241290743Online publication date: 18-Nov-2024
      • (2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
      • (2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024
      • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
      • (2024)Practices for Managing Machine Learning Products: A Multivocal Literature ReviewIEEE Transactions on Engineering Management10.1109/TEM.2023.328775971(7425-7455)Online publication date: 2024
      • (2024)Automated Provenance-Based Screening of ML Data Preparation PipelinesDatenbank-Spektrum10.1007/s13222-024-00483-4Online publication date: 30-Sep-2024
      • (2024)Challenges and practices of deep learning model reengineering: A case study on computer visionEmpirical Software Engineering10.1007/s10664-024-10521-029:6Online publication date: 20-Aug-2024
      • (2024)An empirical study of challenges in machine learning asset managementEmpirical Software Engineering10.1007/s10664-024-10474-429:4Online publication date: 15-Jun-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media