More Web Proxy on the site http://driver.im/

research-article

ARDA: automatic relational data augmentation for machine learning

Authors:

Nadiia Chepurko,

Emanuel Zgraggen,

Raul Castro Fernandez,

David KargerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 9

Pages 1373 - 1387

https://doi.org/10.14778/3397230.3397235

Published: 01 May 2020 Publication History

Abstract

Automatic machine learning (AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal "human-in-the-loop" involvement.

We present ARDA, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

References

[1]

Microsoft Azure Services, http://www.microsoft.com/azure/.

[2]

NYU Auctus, https://datamart.d3m.vida-nyu.org/.

[3]

A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017.

[4]

B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.

[5]

I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 459--468. JMLR.org, 2017.

Digital Library

[6]

J. L. Bentley and A. C.-C. Yao. An almost optimal algorithm for unbounded searching. Information processing letters, 5(SLAC-PUB-1679), 1976.

[7]

A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798, 2014.

[8]

A. Bhardwaj, A. Deshpande, A. J. Elmore, D. Karger, S. Madden, A. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang. Collaborative data analytics with datahub. Proc. VLDB Endow., 8(12):1916--1919, Aug. 2015.

Digital Library

[9]

S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. Proc. VLDB Endow., 8(12):1346--1357, Aug. 2015.

Digital Library

[10]

L. Breiman. Bias, Variance, and Arcing Classifiers. Technical report, 1996.

[11]

D. Brickley, M. Burgess, and N. Noy. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference, pages 1365--1375, 2019.

Digital Library

[12]

M. J. Cafarella, A. Halevy, and N. Khoussainova. Data integration for the relational web. Proc. VLDB Endow., 2(1):1090--1101, Aug. 2009.

Digital Library

[13]

E. J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925--936, 2010.

[14]

E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.

[15]

E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489--509, 2006.

Digital Library

[16]

N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3--14. ACM, 2017.

Digital Library

[17]

R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, et al. A demo of the data civilizer system. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1639--1642. ACM, 2017.

Digital Library

[18]

G. C. Cawley, N. L. Talbot, and M. Girolami. Sparse multinomial logistic regression via bayesian l1 regularisation. In Advances in neural information processing systems, pages 209--216, 2007.

[19]

G. Chandrashekar and F. Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16--28, Jan. 2014.

Digital Library

[20]

K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 81--90. ACM, 2013.

Digital Library

[21]

A. C. Davison. Statistical models, volume 11. Cambridge University Press, 2003.

[22]

L. Deng. Table2vec: Neural word and entity embeddings for table population and retrieval. Master's thesis, University of Stavanger, Norway, 2018.

[23]

P. Domingos. The role of occam's razor in knowledge discovery. Data mining and knowledge discovery, 3(4):409--425, 1999.

[24]

B. Donovan and D. Work. New york city taxi trip data (2010-2013), 2016.

[25]

P. A. Estévez, M. Tesmer, C. A. Perez, and J. M. Zurada. Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2):189--201, 2009.

Digital Library

[26]

R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1001--1012. IEEE, 2018.

[27]

M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In Advances in neural information processing systems, pages 2962--2970, 2015.

[28]

N. Fusi, R. Sheth, and M. Elibol. Probabilistic matrix factorization for automated machine learning. In Advances in Neural Information Processing Systems, pages 3348--3357, 2018.

[29]

W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the css2 visual box model. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 1313. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.

[30]

H. Gonzalez, A. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, and W. Shen. Google fusion tables: data management, integration and collaboration in the cloud. In Proceedings of the 1st ACM symposium on Cloud computing, pages 175--180. ACM, 2010.

Digital Library

[31]

I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1--3):389--422, 2002.

[32]

A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data, pages 795--806. ACM, 2016.

Digital Library

[33]

A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR, 2013.

[34]

A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Managing google's data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3):5--14, 2016.

[35]

M. A. Hall and L. A. Smith. Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, page 235--239. AAAI Press, 1999.

Digital Library

[36]

X. He, K. Zhao, and X. Chu. Automl: A survey of the state-of-the-art. arXiv preprint arXiv:1908.00709, 2019.

[37]

G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Machine Learning Proceedings 1994, pages 121--129. Elsevier, 1994.

[38]

S. Khalid, T. Khalil, and S. Nasreen. A survey of feature selection and feature extraction techniques in machine learning. In 2014 Science and Information Conference, pages 372--378. IEEE, 2014.

[39]

K. Kira and L. A. Rendell. A practical approach to feature selection. In Machine Learning Proceedings 1992, pages 249--256. Elsevier, 1992.

[40]

R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial intelligence, 97(1--2):273--324, 1997.

[41]

T. Kraska. Northstar: An Interactive Data Science System. PVLDB, 11(12):2150--2164, 2018.

[42]

A. Kumar, J. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In Proceedings of the 2016 International Conference on Management of Data, pages 19--34. ACM, 2016.

Digital Library

[43]

Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188--1196, 2014.

Digital Library

[44]

L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765--6816, 2017.

Digital Library

[45]

B. Liu, G. Gui, S. Matsushita, and L. Xu. Dimension-reduced direction-of-arrival estimation based on l2/l1--norm penalty. IEEE Access, 6:44433--44444, 2018.

[46]

H. Liu, H. Motoda, R. Setiono, and Z. Zhao. Feature selection: An ever evolving frontier in data mining. In Feature Selection in Data Mining, pages 4--13, 2010.

[47]

H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge & Data Engineering, pages 491--502, 2005.

[48]

Y. Ma, C. Li, X. Mei, C. Liu, and J. Ma. Robust sparse hyperspectral unmixing with l2/l1 norm. IEEE Transactions on Geoscience and Remote Sensing, 55(3):1227--1239, 2016.

[49]

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083, 2017.

[50]

E. C. Marques, N. Maciel, L. Naviner, H. Cai, and J. Yang. A review of sparse recovery algorithms. IEEE Access, 7:1300--1322, 2018.

[51]

P. McCullagh. What is a statistical model? Annals of statistics, pages 1225--1267, 2002.

[52]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.

Digital Library

[53]

J. Nelson and H. L. Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 117--126. IEEE, 2013.

Digital Library

[54]

G. Paolacci, J. Chandler, and P. G. Ipeirotis. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411--419, 2010.

[55]

J. M. Phillips. Coresets and sketches. arXiv preprint arXiv:1601.00617, 2016.

[56]

M. Qian and C. Zhai. Robust unsupervised feature selection. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.

Digital Library

[57]

M. Robnik-Šikonja and I. Kononenko. Theoretical and empirical analysis of relieff and rrelieff. Machine learning, 53(1--2):23--69, 2003.

[58]

V. Shah, A. Kumar, and X. Zhu. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? Proc. VLDB Endow., 11(3):366--379, Nov. 2017.

Digital Library

[59]

Z. Shang, E. Zgraggen, B. Buratti, F. Kossmann, P. Eichmann, Y. Chung, C. Binnig, E. Upfal, and T. Kraska. Democratizing data science through interactive curation of ml pipelines. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, page 1171--1188, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[60]

E. R. Sparks, A. Talwalkar, M. J. Franklin, M. I. Jordan, and T. Kraska. Tupaq: An efficient planner for large-scale predictive analytic queries. arXiv preprint arXiv:1502.00068, 2015.

[61]

M. Stojnic. L2/l1-optimization in block-sparse compressed sensing and its strong thresholds. IEEE Journal of Selected Topics in Signal Processing, 4(2):350--357, 2010.

[62]

X. Sun and B. Bischl. Tutorial and survey on probabilistic graphical model and variational inference in deep reinforcement learning. arXiv preprint arXiv:1908.09381, 2019.

[63]

Y. Sun. Iterative relief for feature weighting: algorithms, theories, and applications. IEEE transactions on pattern analysis and machine intelligence, 29(6):1035--1051, 2007.

[64]

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

[65]

I. G. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging yourney from the wild to the lake. In CIDR, 2015.

[66]

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 847--855, 2013.

Digital Library

[67]

E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.

[68]

D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1--2):1--157, 2014.

[69]

Y. Xiao, S.-Y. Wu, and B.-S. He. A proximal alternating direction method for l2/l1 norm least squares problem in multi-task feature learning. Journal of Industrial and Management Optimization, 8(4):1057, 2012.

[70]

M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 97--108. ACM, 2012.

Digital Library

[71]

J. Yang and V. Honavar. Feature subset selection using a genetic algorithm. In Feature extraction, construction and selection, pages 117--136. Springer, 1998.

Cited By

Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659448
Wang THuang SBao ZCulpepper JDedeoglu VArablouei R(2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
https://doi.org/10.14778/3648160.3648172
Show More Cited By

ARDA: automatic relational data augmentation for machine learning
1. Computing methodologies

Recommendations

AoEs: enhancing teleportation experience in immersive environment with mid-air haptics
SIGGRAPH '17: ACM SIGGRAPH 2017 Emerging Technologies

To alleviate cybersickness in the immersive virtual reality (VR), teleportation is a common method of moving around in virtual spaces. Although users can receive the visual and auditory feedbacks from their first-person perspective with the advances of ...
Haptic around: multiple tactile sensations for immersive environment and interaction in virtual reality
VRST '18: Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology

In this paper, we present Haptic Around, a hybrid-haptic feedback system, which utilizes fan, hot air blower, mist creator and heat light to recreate multiple tactile sensations in virtual reality for enhancing the immersive environment and interaction. ...
Genetic algorithms in feature and instance selection

Feature selection and instance selection are two important data preprocessing steps in data mining, where the former is aimed at removing some irrelevant and/or redundant features from a given dataset and the latter at discarding the faulty data. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 9

May 2020

295 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 May 2020

Published in PVLDB Volume 13, Issue 9

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
170
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)4

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659448
Wang THuang SBao ZCulpepper JDedeoglu VArablouei R(2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
https://doi.org/10.14778/3648160.3648172
Chen KKoudas N(2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654984
Li YYu XKoudas N(2024)Data Acquisition for Improving Model ConfidenceProceedings of the ACM on Management of Data10.1145/36549342:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654934
Gan QWang MWipf DFaloutsos CBaeza-Yates RBonchi F(2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671471
Ionescu AMouw ZAivaloglou EHai RKatsifodimos ASerra ESpezzano F(2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679211
Leventidis AChristensen MLissandrini MDi Rocco LHose KMiller RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657877
Chang JCui BNargesian FAsudeh AJagadish H(2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00849-w
Naik AThakkar AStein AAlur RNaik M(2023)Relational Query Synthesis ⋈ Decision Tree LearningProceedings of the VLDB Endowment10.14778/3626292.362630617:2(250-263)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626306
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents