More Web Proxy on the site http://driver.im/

research-article

Data Management in Machine Learning: Challenges, Techniques, and Systems

Authors:

Matthias Boehm,

Jun YangAuthors Info & Claims

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 1717 - 1722

https://doi.org/10.1145/3035918.3054775

Published: 09 May 2017 Publication History

Abstract

Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related challenges that arise in ML workloads, and has built several systems for advanced analytics. This tutorial provides a comprehensive review of such systems and analyzes key data management challenges and techniques. We focus on three complementary lines of work: (1) integrating ML algorithms and languages with existing data systems such as RDBMSs, (2) adapting data management-inspired techniques such as query optimization, partitioning, and compression to new systems that target ML workloads, and (3) combining data management and ML ideas to build systems that improve ML lifecycle-related tasks. Finally, we identify key open data management challenges for future research in this important area.

References

[1]

Apache Mahout. mahout.apache.org.

[2]

Microsoft AzureML Studio. studio.azureml.net.

[3]

Oracle R Enterprise. www.oracle.com/technetwork/database/database-technologies/r/r-enterprise.

[4]

M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. OSDI, 2016.

Digital Library

[5]

D. Agrawal et al. SparkBench - A Spark Performance Testing Suite. In TPCTC, 2015.

[6]

M. Akdere et al. The Case for Predictive Database Systems: Opportunities and Challenges. In CIDR, 2011.

[7]

A. Alexandrov et al. Implicit Parallelism through Deep Language Embedding. In SIGMOD, 2015.

Digital Library

[8]

M. Anderson and M. Cafarella. Input Selection for Fast Feature Engineering. In ICDE, 2016.

[9]

M. Anderson et al. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.

[10]

M. Armbrust et al. Introduction to spark 2.0 for database researchers. In SIGMOD, 2016.

Digital Library

[11]

A. Ashari et al. On Optimizing Machine Learning Workloads via Kernel Fusion. In PPoPP, 2015.

Digital Library

[12]

P. Bailis et al. MacroBase: Prioritizing Attention in Fast Data. In SIGMOD, 2017.

Digital Library

[13]

C. K. Baru et al. Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data. In TPCTC, 2014.

[14]

J. Bergstra et al. Theano: a CPU and GPU Math Expression Compiler. In SciPy, 2010.

[15]

M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB, 7(7), 2014.

Digital Library

[16]

M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull., 37(3), 2014.

[17]

M. Boehm et al. SystemML: Declarative Machine Learning on Spark. PVLDB, 9(13), 2016.

Digital Library

[18]

V. R. Borkar et al. Declarative Systems for Large-Scale Machine Learning. IEEE Data Eng. Bull., 35(2), 2012.

[19]

Z. Cai et al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013.

Digital Library

[20]

Z. Cai et al. A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms. In SIGMOD, 2014.

Digital Library

[21]

L. Chen et al. Towards Linear Algebra over Normalized Data. CoRR, abs/1612.07448, 2016.

[22]

Y. Cheng et al. GLADE: Big Data Analytics Made Easy. In SIGMOD, 2012.

Digital Library

[23]

B.-G. Chun et al. Dolphin: Runtime Optimization for Distributed Machine Learning. In ICML Workshop MLSystems, 2016.

[24]

J. Cohen et al. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2), 2009.

Digital Library

[25]

R. Collobert et al. Torch7: A Matlab-like Environment for Machine Learning. In NIPS Workshop BigLearn, 2011.

[26]

T. Condie et al. Machine Learning for Big Data. In SIGMOD, 2013.

Digital Library

[27]

D. Crankshaw et al. The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. In CIDR, 2015.

[28]

D. Crankshaw et al. Clipper: A Low-Latency Online Prediction Serving System. In NSDI, 2017.

Digital Library

[29]

A. Crotty et al. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12), 2015.

Digital Library

[30]

A. Crotty et al. Vizdom: Interactive Analytics through Pen and Touch. PVLDB, 8(12), 2015.

Digital Library

[31]

C. De Sa et al. DeepDive: Declarative Knowledge Base Construction. SIGMOD Record, 45(1), 2016.

Digital Library

[32]

A. Deshpande and S. Madden. MauveDB: Supporting Model-based User Views in Database Systems. In SIGMOD, 2006.

Digital Library

[33]

P. Domingos. A Few Useful Things to Know About Machine Learning. ACM CACM, 55(10), 2012.

Digital Library

[34]

S. Duan and S. Babu. Processing Forecasting Queries. In VLDB, 2007.

Digital Library

[35]

T. Elgamal et al. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR, 2017.

[36]

A. Elgohary et al. Compressed Linear Algebra for Large-Scale Machine Learning. PVLDB, 9(12), 2016.

Digital Library

[37]

S. Ewen et al. Spinning Fast Iterative Data Flows. PVLDB, 5(11), 2012.

Digital Library

[38]

X. Feng et al. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012.

Digital Library

[39]

U. Fischer. Forecasting in Database Systems. PhD thesis, Technische Universitaet Dresden, 2014.

[40]

Z. J. Gao et al. The BUDS Language for Distributed Bayesian Machine Learning. In SIGMOD, 2017.

Digital Library

[41]

T. Ge and S. B. Zdonik. A Skip-list Approach for Efficiently Processing Forecasting Queries. PVLDB, 1(1), 2008.

Digital Library

[42]

L. Getoor and B. Taskar. Introduction to Statistical Relational Learning. The MIT Press, 2007.

Digital Library

[43]

J. E. Gonzalez et al. Asynchronous Complex Analytics in a Distributed Dataflow Architecture. CoRR, 2015.

[44]

I. Guyon et al. Feature Extraction: Foundations and Applications. New York: Springer-Verlag, 2001.

Digital Library

[45]

T. Hastie et al. The Elements of Statistical Learning: Data mining, Inference, and Prediction. Springer-Verlag, 2001.

[46]

J. M. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB, 5(12), 2012.

Digital Library

[47]

H. Herodotou et al. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.

[48]

B. Huang et al. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD, 2013.

Digital Library

[49]

B. Huang et al. Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances. PVLDB, 9(3), 2015.

Digital Library

[50]

B. Huang et al. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD, 2015.

Digital Library

[51]

S. Huang et al. The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In ICDE Workshop WISS, 2010.

[52]

Y. Jia et al. Caffe: Convolutional Architecture for Fast Feature Embedding. CoRR, 2014.

Digital Library

[53]

R. J. L. John et al. Ava: From Data to Insights Through Conversation. In CIDR, 2017.

[54]

D. Kernert et al. SLACID - Sparse Linear Algebra in a Column-Oriented In-Memory Database System. In SSDBM, 2014.

Digital Library

[55]

D. Kernert et al. SpMacho - Optimizing Sparse Linear Algebra Expressions with Probabilistic Density Estimation. In EDBT, 2015.

[56]

D. Kernert et al. Topology-Aware Optimization of Big Sparse Matrices and Matrix Multiplications on Main-Memory Systems. In ICDE, 2016.

[57]

M. A. Khamis et al. FAQ: Questions Asked Frequently. In PODS, 2016.

Digital Library

[58]

M. Kim. TensorDB and Tensor-Relational Model (TRM) for Efficient Tensor-Relational Operations. PhD thesis, Arizona State University, 2014.

[59]

M. L. Koc and C. Ré. Incrementally maintaining classification using an RDBMS. PVLDB, 4(5), 2011.

Digital Library

[60]

T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.

[61]

S. Krishnan et al. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB, 9(12), 2016.

Digital Library

[62]

A. Kumar et al. Demonstration of Santoku: Optimizing Machine Learning over Normalized Data. PVLDB, 8(12), 2015.

Digital Library

[63]

A. Kumar et al. Learning Generalized Linear Models Over Normalized Data. In SIGMOD, 2015.

Digital Library

[64]

A. Kumar et al. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 2015.

Digital Library

[65]

A. Kumar et al. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. In SIGMOD, 2016.

Digital Library

[66]

Y. Lecun et al. Deep Learning. Nature, 521:436--444, 5 2015.

[67]

Y. Low et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB, 5(8), 2012.

Digital Library

[68]

R. Marcus and O. Papaemmanouil. WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases. PVLDB, 9(10), 2016.

Digital Library

[69]

C. Mayfield et al. ERACER: A Database Approach for Statistical Inference and Data Cleaning. In SIGMOD, 2010.

Digital Library

[70]

X. Meng et al. MLlib: Machine Learning in Apache Spark. JMLR, 17(1), 2016.

Digital Library

[71]

H. Miao et al. ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning. CoRR, 2016.

[72]

S. Narayanamurthy et al. Towards Resource-Elastic Machine Learning. In NIPS Workshop BigLearn, 2013.

[73]

M. Nikolic et al. LINVIEW: Incremental View Maintenance for Complex Analytical Queries. In SIGMOD, 2014.

Digital Library

[74]

F. Niu et al. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS, 2011.

Digital Library

[75]

F. Niu et al. Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS. PVLDB, 4(6), 2011.

Digital Library

[76]

C. Ordonez et al. One-pass Data Mining Algorithms in a DBMS with UDFs. In SIGMOD, 2011.

Digital Library

[77]

X. Pan et al. Hemingway: Modeling Distributed Optimization Algorithms. In NIPS Workshop MLSystems, 2016.

[78]

S. Papadopoulos et al. The TileDB Array Data Storage Manager. PVLDB, 10(4), 2016.

Digital Library

[79]

A. Pavlo et al. On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems. PVLDB, 5(2), 2011.

Digital Library

[80]

A. Pavlo et al. Self-Driving Database Management Systems. In CIDR, 2017.

[81]

S. Prasad et al. Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction. In SIGMOD, 2015.

Digital Library

[82]

C. Qin and F. Rusu. Speculative Approximations for Terascale Distributed Gradient Descent Optimization. In SIGMOD DanaC, 2012.

Digital Library

[83]

C. Ré et al. Feature Engineering for Knowledge Base Construction. IEEE Data Eng. Bull., 2014.

[84]

S. Rendle. Scaling Factorization Machines to Relational Data. PVLDB, 6(5), 2013.

Digital Library

[85]

S. Schelter et al. "All Roads Lead to Rome:" Optimistic Recovery for Distributed Iterative Data Processing. In CIKM, 2013.

Digital Library

[86]

S. Schelter et al. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. NIPS Workshop MLSystems, 2016.

[87]

M. Schleich et al. Learning Linear Regression Models over Factorized Joins. In SIGMOD, 2016.

Digital Library

[88]

J. Shin et al. Incremental Knowledge Base Construction Using DeepDive. PVLDB, 8(11), 2015.

Digital Library

[89]

P. Shivam et al. Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications. In VLDB, 2006.

Digital Library

[90]

E. Soroush et al. ArrayStore: A Storage Manager for Complex Parallel Array Processing. In SIGMOD, 2011.

Digital Library

[91]

E. R. Sparks et al. Automating Model Search for Large Scale Machine Learning. In SoCC, 2015.

Digital Library

[92]

E. R. Sparks et al. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. In ICDE, 2017.

[93]

M. Stonebraker et al. The Architecture of SciDB. In SSDBM, 2011.

Digital Library

[94]

A. K. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML, 2011.

[95]

R. Taft et al. GenBase: A Complex Analytics Genomics Benchmark. In SIGMOD, 2014.

Digital Library

[96]

C. Thornton et al. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In KDD, 2013.

Digital Library

[97]

M. Vartak et al. Supporting Fast Iteration in Model Building. In NIPS Workshop LearningSys, 2015.

[98]

M. Vartak et al. MODELDB: A System for Machine Learning Model Management. In SIGMOD Workshop HILDA, 2016.

Digital Library

[99]

S. Venkataraman et al. SparkR: Scaling R Programs with Spark. In SIGMOD, 2016.

Digital Library

[100]

H. Wang et al. ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams. In VLDB, 2003.

Digital Library

[101]

W. Wang et al. Database Meets Deep Learning: Challenges and Opportunities. SIGMOD Record, 45(2), 2016.

Digital Library

[102]

F. Wolf et al. Extending Database Task Schedulers for Multi-threaded Application Code. In SSDBM, 2015.

Digital Library

[103]

D. Yan et al. Big Graph Analytics Systems. In SIGMOD, 2016.

Digital Library

[104]

O. D. L. Yejas et al. Big R: Large-Scale Analytics on Hadoop Using R. In Big Data, 2014.

Digital Library

[105]

C. Zhang et al. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.

Digital Library

[106]

C. Zhang and C. Ré. Towards High-throughput Gibbs Sampling at Scale: A Study Across Storage Managers. In SIGMOD, 2013.

Digital Library

[107]

C. Zhang and C. Ré. DimmWitted: A Study of Main-Memory Statistical Analytics. PVLDB, 7(12), 2014.

Digital Library

[108]

M. Zhang et al. Measuring and Optimizing Distributed Array Programs. PVLDB, 9(12), 2016.

Digital Library

[109]

Y. Zhang et al. RIOT: I/O-Efficient Numerical Computing without SQL. In CIDR, 2009.

[110]

Y. Zhang et al. Storing Matrices on Disk: Theory and Practice Revisited. PVLDB, 4(11), 2011.

Digital Library

[111]

Y. Zhang and J. Yang. Optimizing I/O for Big Array Analytics. PVLDB, 5(8), 2012.

Digital Library

Cited By

He MQian QLiu XZhang JCurry J(2024)Recent Progress on Surface Water Quality Models Utilizing Machine Learning TechniquesWater10.3390/w1624361616:24(3616)Online publication date: 15-Dec-2024
https://doi.org/10.3390/w16243616
Hong WChoi SOh S(2024)Parametric seasonal-trend autoregressive neural network for long-term crop price forecastingPLOS ONE10.1371/journal.pone.031119919:9(e0311199)Online publication date: 26-Sep-2024
https://doi.org/10.1371/journal.pone.0311199
Heo SNa S(2024)Developing WasteSAM: A novel approach for accurate construction waste image segmentation to facilitate efficient recyclingWaste Management & Research: The Journal for a Sustainable Circular Economy10.1177/0734242X241290743Online publication date: 18-Nov-2024
https://doi.org/10.1177/0734242X241290743
Show More Cited By

Index Terms

Data Management in Machine Learning: Challenges, Techniques, and Systems
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems

Recommendations

Machine learning with big data analytics for cloud security
Highlights
- Machine Learning-assisted cloud computing model (ML-CCM) with big data analytics has been proposed to increase security and improve data transmission.
Abstract
The amount of data generated and transmitted more quickly, particularly with the demand for action in real-time, has greatly increased with the growing number of internet-connected devices. With the rising diversity of data and need ...
Graphical abstract

Display Omitted
Machine learning on big data

Machine learning (ML) is continuously unleashing its power in a wide range of applications. It has been pushed to the forefront in recent years partly owing to the advent of big data. ML algorithms have never been better promised while challenged by big ...
Big Data Processing using Machine Learning algorithms: MLlib and Mahout Use Case
SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications

Machine learning is a field within artificial intelligence that allows machines to learn on their own from existing information to make predictions or/and decisions. There are three main categories of machine learning techniques: Collaborative filtering ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

May 2017

1810 pages

ISBN:9781450341974

DOI:10.1145/3035918

General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

89
Total Citations
View Citations
3,801
Total Downloads

Downloads (Last 12 months)325
Downloads (Last 6 weeks)23

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

He MQian QLiu XZhang JCurry J(2024)Recent Progress on Surface Water Quality Models Utilizing Machine Learning TechniquesWater10.3390/w1624361616:24(3616)Online publication date: 15-Dec-2024
https://doi.org/10.3390/w16243616
Hong WChoi SOh S(2024)Parametric seasonal-trend autoregressive neural network for long-term crop price forecastingPLOS ONE10.1371/journal.pone.031119919:9(e0311199)Online publication date: 26-Sep-2024
https://doi.org/10.1371/journal.pone.0311199
Heo SNa S(2024)Developing WasteSAM: A novel approach for accurate construction waste image segmentation to facilitate efficient recyclingWaste Management & Research: The Journal for a Sustainable Circular Economy10.1177/0734242X241290743Online publication date: 18-Nov-2024
https://doi.org/10.1177/0734242X241290743
Bachinger FEhrlinger LKronberger GWöss W(2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3661826
Zhang CFarouk T(2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3649133
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Alves ILeite LMeirelles PKon FAguiar C(2024)Practices for Managing Machine Learning Products: A Multivocal Literature ReviewIEEE Transactions on Engineering Management10.1109/TEM.2023.328775971(7425-7455)Online publication date: 2024
https://doi.org/10.1109/TEM.2023.3287759
Schelter SGuha SGrafberger S(2024)Automated Provenance-Based Screening of ML Data Preparation PipelinesDatenbank-Spektrum10.1007/s13222-024-00483-4Online publication date: 30-Sep-2024
https://doi.org/10.1007/s13222-024-00483-4
Jiang WBanna VVivek NGoel ASynovic NThiruvathukal GDavis J(2024)Challenges and practices of deep learning model reengineering: A case study on computer visionEmpirical Software Engineering10.1007/s10664-024-10521-029:6Online publication date: 20-Aug-2024
https://doi.org/10.1007/s10664-024-10521-0
Zhao ZChen YBangash AAdams BHassan A(2024)An empirical study of challenges in machine learning asset managementEmpirical Software Engineering10.1007/s10664-024-10474-429:4Online publication date: 15-Jun-2024
https://doi.org/10.1007/s10664-024-10474-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten