[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

ARDA: automatic relational data augmentation for machine learning

Published: 01 May 2020 Publication History

Abstract

Automatic machine learning (AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal "human-in-the-loop" involvement.
We present ARDA, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

References

[1]
Microsoft Azure Services, http://www.microsoft.com/azure/.
[2]
NYU Auctus, https://datamart.d3m.vida-nyu.org/.
[3]
A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017.
[4]
B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
[5]
I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 459--468. JMLR.org, 2017.
[6]
J. L. Bentley and A. C.-C. Yao. An almost optimal algorithm for unbounded searching. Information processing letters, 5(SLAC-PUB-1679), 1976.
[7]
A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798, 2014.
[8]
A. Bhardwaj, A. Deshpande, A. J. Elmore, D. Karger, S. Madden, A. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang. Collaborative data analytics with datahub. Proc. VLDB Endow., 8(12):1916--1919, Aug. 2015.
[9]
S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. Proc. VLDB Endow., 8(12):1346--1357, Aug. 2015.
[10]
L. Breiman. Bias, Variance, and Arcing Classifiers. Technical report, 1996.
[11]
D. Brickley, M. Burgess, and N. Noy. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference, pages 1365--1375, 2019.
[12]
M. J. Cafarella, A. Halevy, and N. Khoussainova. Data integration for the relational web. Proc. VLDB Endow., 2(1):1090--1101, Aug. 2009.
[13]
E. J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925--936, 2010.
[14]
E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
[15]
E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489--509, 2006.
[16]
N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3--14. ACM, 2017.
[17]
R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, et al. A demo of the data civilizer system. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1639--1642. ACM, 2017.
[18]
G. C. Cawley, N. L. Talbot, and M. Girolami. Sparse multinomial logistic regression via bayesian l1 regularisation. In Advances in neural information processing systems, pages 209--216, 2007.
[19]
G. Chandrashekar and F. Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16--28, Jan. 2014.
[20]
K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 81--90. ACM, 2013.
[21]
A. C. Davison. Statistical models, volume 11. Cambridge University Press, 2003.
[22]
L. Deng. Table2vec: Neural word and entity embeddings for table population and retrieval. Master's thesis, University of Stavanger, Norway, 2018.
[23]
P. Domingos. The role of occam's razor in knowledge discovery. Data mining and knowledge discovery, 3(4):409--425, 1999.
[24]
B. Donovan and D. Work. New york city taxi trip data (2010-2013), 2016.
[25]
P. A. Estévez, M. Tesmer, C. A. Perez, and J. M. Zurada. Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2):189--201, 2009.
[26]
R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1001--1012. IEEE, 2018.
[27]
M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In Advances in neural information processing systems, pages 2962--2970, 2015.
[28]
N. Fusi, R. Sheth, and M. Elibol. Probabilistic matrix factorization for automated machine learning. In Advances in Neural Information Processing Systems, pages 3348--3357, 2018.
[29]
W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the css2 visual box model. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 1313. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.
[30]
H. Gonzalez, A. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, and W. Shen. Google fusion tables: data management, integration and collaboration in the cloud. In Proceedings of the 1st ACM symposium on Cloud computing, pages 175--180. ACM, 2010.
[31]
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1--3):389--422, 2002.
[32]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data, pages 795--806. ACM, 2016.
[33]
A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR, 2013.
[34]
A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Managing google's data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3):5--14, 2016.
[35]
M. A. Hall and L. A. Smith. Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, page 235--239. AAAI Press, 1999.
[36]
X. He, K. Zhao, and X. Chu. Automl: A survey of the state-of-the-art. arXiv preprint arXiv:1908.00709, 2019.
[37]
G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Machine Learning Proceedings 1994, pages 121--129. Elsevier, 1994.
[38]
S. Khalid, T. Khalil, and S. Nasreen. A survey of feature selection and feature extraction techniques in machine learning. In 2014 Science and Information Conference, pages 372--378. IEEE, 2014.
[39]
K. Kira and L. A. Rendell. A practical approach to feature selection. In Machine Learning Proceedings 1992, pages 249--256. Elsevier, 1992.
[40]
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial intelligence, 97(1--2):273--324, 1997.
[41]
T. Kraska. Northstar: An Interactive Data Science System. PVLDB, 11(12):2150--2164, 2018.
[42]
A. Kumar, J. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In Proceedings of the 2016 International Conference on Management of Data, pages 19--34. ACM, 2016.
[43]
Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188--1196, 2014.
[44]
L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765--6816, 2017.
[45]
B. Liu, G. Gui, S. Matsushita, and L. Xu. Dimension-reduced direction-of-arrival estimation based on l2/l1--norm penalty. IEEE Access, 6:44433--44444, 2018.
[46]
H. Liu, H. Motoda, R. Setiono, and Z. Zhao. Feature selection: An ever evolving frontier in data mining. In Feature Selection in Data Mining, pages 4--13, 2010.
[47]
H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge & Data Engineering, pages 491--502, 2005.
[48]
Y. Ma, C. Li, X. Mei, C. Liu, and J. Ma. Robust sparse hyperspectral unmixing with l2/l1 norm. IEEE Transactions on Geoscience and Remote Sensing, 55(3):1227--1239, 2016.
[49]
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083, 2017.
[50]
E. C. Marques, N. Maciel, L. Naviner, H. Cai, and J. Yang. A review of sparse recovery algorithms. IEEE Access, 7:1300--1322, 2018.
[51]
P. McCullagh. What is a statistical model? Annals of statistics, pages 1225--1267, 2002.
[52]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.
[53]
J. Nelson and H. L. Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 117--126. IEEE, 2013.
[54]
G. Paolacci, J. Chandler, and P. G. Ipeirotis. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411--419, 2010.
[55]
J. M. Phillips. Coresets and sketches. arXiv preprint arXiv:1601.00617, 2016.
[56]
M. Qian and C. Zhai. Robust unsupervised feature selection. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
[57]
M. Robnik-Šikonja and I. Kononenko. Theoretical and empirical analysis of relieff and rrelieff. Machine learning, 53(1--2):23--69, 2003.
[58]
V. Shah, A. Kumar, and X. Zhu. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? Proc. VLDB Endow., 11(3):366--379, Nov. 2017.
[59]
Z. Shang, E. Zgraggen, B. Buratti, F. Kossmann, P. Eichmann, Y. Chung, C. Binnig, E. Upfal, and T. Kraska. Democratizing data science through interactive curation of ml pipelines. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, page 1171--1188, New York, NY, USA, 2019. Association for Computing Machinery.
[60]
E. R. Sparks, A. Talwalkar, M. J. Franklin, M. I. Jordan, and T. Kraska. Tupaq: An efficient planner for large-scale predictive analytic queries. arXiv preprint arXiv:1502.00068, 2015.
[61]
M. Stojnic. L2/l1-optimization in block-sparse compressed sensing and its strong thresholds. IEEE Journal of Selected Topics in Signal Processing, 4(2):350--357, 2010.
[62]
X. Sun and B. Bischl. Tutorial and survey on probabilistic graphical model and variational inference in deep reinforcement learning. arXiv preprint arXiv:1908.09381, 2019.
[63]
Y. Sun. Iterative relief for feature weighting: algorithms, theories, and applications. IEEE transactions on pattern analysis and machine intelligence, 29(6):1035--1051, 2007.
[64]
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[65]
I. G. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging yourney from the wild to the lake. In CIDR, 2015.
[66]
C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 847--855, 2013.
[67]
E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.
[68]
D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1--2):1--157, 2014.
[69]
Y. Xiao, S.-Y. Wu, and B.-S. He. A proximal alternating direction method for l2/l1 norm least squares problem in multi-task feature learning. Journal of Industrial and Management Optimization, 8(4):1057, 2012.
[70]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 97--108. ACM, 2012.
[71]
J. Yang and V. Honavar. Feature subset selection using a genetic algorithm. In Feature extraction, construction and selection, pages 117--136. Springer, 1998.

Cited By

View all
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
  • (2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
  • Show More Cited By
  1. ARDA: automatic relational data augmentation for machine learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 13, Issue 9
    May 2020
    295 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 May 2020
    Published in PVLDB Volume 13, Issue 9

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
    • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
    • (2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
    • (2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
    • (2024)Data Acquisition for Improving Model ConfidenceProceedings of the ACM on Management of Data10.1145/36549342:3(1-25)Online publication date: 30-May-2024
    • (2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
    • (2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
    • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
    • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
    • (2023)Relational Query Synthesis ⋈ Decision Tree LearningProceedings of the VLDB Endowment10.14778/3626292.362630617:2(250-263)Online publication date: 1-Oct-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media