More Web Proxy on the site http://driver.im/

research-article

AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications

Authors:

Qiang YangAuthors Info & Claims

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1936 - 1945

https://doi.org/10.1145/3292500.3330679

Published: 25 July 2019 Publication History

Abstract

Feature crossing captures interactions among categorical features and is useful to enhance learning from tabular data in real-world businesses. In this paper, we present AutoCross, an automatic feature crossing tool provided by 4Paradigm to its customers, ranging from banks, hospitals, to Internet corporations. By performing beam search in a tree-structured space, AutoCross enables efficient generation of high-order cross features, which is not yet visited by existing works. Additionally, we propose successive mini-batch gradient descent and multi-granularity discretization to further improve efficiency and effectiveness, while ensuring simplicity so that no machine learning expertise or tedious hyper-parameter tuning is required. Furthermore, the algorithms are designed to reduce the computational, transmitting, and storage costs involved in distributed computing. Experimental results on both benchmark and real-world business datasets demonstrate the effectiveness and efficiency of AutoCross. It is shown that AutoCross can significantly enhance the performance of both linear and deep models.

References

[1]

R. Agrawal, T. Imieli'nski, and A. Swami. 1993. Mining association rules between sets of items in large databases. In ACM Sigmod Record, Vol. 22. ACM, 207--216.

Digital Library

[2]

M. Blondel, A. Fujino, N. Ueda, and M. Ishihata. 2016. Higher-order factorization machines. In Advances in Neural Information Processing Systems. 3351--3359.

Digital Library

[3]

J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. 2013. Recommender systems survey. Knowledge-Based Systems, Vol. 46 (2013), 109--132.

Digital Library

[4]

R. Bolton and D. Hand. 2002. Statistical fraud detection: A review. Statistical science (2002), 235--249.

[5]

O. Chapelle, E. Manavoglu, and R. Rosales. 2015. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 5, 4 (2015), 61.

Digital Library

[6]

C. Cheng, F. Xia, T. Zhang, I. King, and M. Lyu. 2014. Gradient boosting factorization machines. In ACM Conference on Recommender systems. 265--272.

Digital Library

[7]

H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, and M. Ispir. 2016. Wide & deep learning for recommender systems. In Workshop on Deep Learning for Recommender Systems. 7--10.

Digital Library

[8]

D. Crankshaw, X. Wang, G. Zhou, M. Franklin, J. Gonzalez, and I. Stoica. 2017. Clipper: A low-latency online prediction serving system. In USENIX Symposium on Networked Systems Design and Implementation. 613--627.

Digital Library

[9]

P. Domingos. 2012. A few useful things to know about machine learning. Commun. ACM, Vol. 55, 10 (2012), 78--87.

Digital Library

[10]

D. Evans. 2009. The online advertising industry: Economics, evolution, and privacy. Journal of Economic Perspectives, Vol. 23, 3 (2009), 37--60.

[11]

W. Fan, E. Zhong, J. Peng, O. Verscheure, K. Zhang, J. Ren, R. Yan, and Q. Yang. 2010. Generalized and heuristic-free feature construction for improved accuracy. In SIAM International Conference on Data Mining. 629--640.

[12]

J. Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.

[13]

H. Guo and W. Hsu. 2002. A survey of algorithms for real-time Bayesian network inference. In Join Workshop on Real Time Decision Support and Diagnosis Systems.

[14]

H. Guo, R. Tang, Y. Ye, Z. Li, and X. He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In International Joint Conference on Artificial Intelligence. 1725--1731.

Digital Library

[15]

J. Han, J. Pei, and M. Kamber. 2011. Data mining: concepts and techniques. Elsevier.

Digital Library

[16]

J. Han, J. Pei, and Y. Yin. 2000. Mining frequent patterns without candidate generation. In ACM Sigmod Record, Vol. 29. 1--12.

Digital Library

[17]

S. Han, H. Mao, and W. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations.

[18]

K. Jamieson and A. Talwalkar. 2016. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics. 240--248.

[19]

Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin. 2016. Field-aware factorization machines for CTR prediction. In ACM Conference on Recommender Systems. 43--50.

Digital Library

[20]

J. Kanter and K. Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. In IEEE International Conference on Data Science and Advanced Analytics. 1--10.

[21]

G. Katz, E. Shin, and D. Song. 2016. Explorekit: Automatic feature generation and selection. In International Conference on Data Mining. 979--984.

[22]

D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations.

[23]

I. Kononenko. 2001. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine, Vol. 23, 1 (2001), 89--109.

Digital Library

[24]

S. Kotsiantis and D. Kanellopoulos. 2006. Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, Vol. 32, 1 (2006), 47--58.

[25]

M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. Andersen, and A. Smola. 2013. Parameter server for distributed machine learning. In Big Learning NIPS Workshop, Vol. 6. 2.

[26]

J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In International Conference on Knowledge Discovery & Data Mining.

Digital Library

[27]

H. Liu, F. Hussain, C. Tan, and M. Dash. 2002. Discretization: An enabling technique. Data mining and knowledge discovery, Vol. 6, 4 (2002), 393--423.

Digital Library

[28]

H. Liu, H. sand Motoda. 1998. Feature extraction, construction and selection: A data mining perspective. Vol. 453. Springer Science & Business Media.

Digital Library

[29]

M. Medress, F. Cooper, J. Forgie, C. Green, D. Klatt, M. O'Malley, E. Neuburg, A. Newell, and B. Reddy, D Ritea. 1977. Speech understanding systems: Report of a steering committee. Artificial Intelligence, Vol. 9, 3 (1977), 307--316.

[30]

L. Meier, S. Van De Geer, and P. Bühlmann. 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 70, 1 (2008), 53--71.

[31]

T. Mitchell. 1997. Machine learning. Springer Science & Business Media.

Digital Library

[32]

R. Ng, L. Lakshmanan, J. Han, and A. Pang. 1998. Exploratory mining and pruning optimizations of constrained associations rules. In ACM Sigmod Record, Vol. 27. ACM, 13--24.

Digital Library

[33]

Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang. 2016. Product-based neural networks for user response prediction. In IEEE International Conference on Data Mining. IEEE, 1149--1154.

[34]

R. Rosales, H. Cheng, and E. Manavoglu. 2012. Post-click conversion modeling and analysis for non-guaranteed delivery display advertising. In ACM International Conference on Web Search and Data Mining. 293--302.

Digital Library

[35]

M. Smith and L. Bull. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines, Vol. 6, 3 (2005), 265--281.

Digital Library

[36]

B. Tran, B. Xue, and M. Zhang. 2016. Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Computing, Vol. 8, 1 (2016), 3--15.

[37]

R. Wang, B. Fu, G. Fu, and M. Wang. 2017. Deep & cross network for ad click predictions. In KDD Workshop. ACM, 12.

Digital Library

[38]

S. Wang. 2010. A comprehensive survey of data mining-based accounting-fraud detection research. In Intelligent Computation Technology and Automation (ICICTA), 2010 International Conference on, Vol. 1. IEEE, 50--53.

Digital Library

[39]

K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola. 2009. Feature hashing for large scale multitask learning. In International Conference on Machine Learning.

Digital Library

[40]

Q. Yao, M. Wang, Y. Chen, W. Dai, Y. Hu, Y. Li, W.-W. Tu, Q. Yang, and Y. Yu. 2018. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. Technical Report. arXiv preprint.

[41]

R. Zeff and B. Aronson. 1999. Advertising on the Internet. John Wiley & Sons, Inc.

Digital Library

[42]

W. Zhang, T. Du, and J. Wang. 2016. Deep learning over multi-field categorical data. In European conference on information retrieval. Springer, 45--57.

[43]

Y. Zhang, Q. Yao, W. Dai, and L. Chen. 2019. AutoKGE: Searching Scoring Functions for Knowledge Graph Embedding. Technical Report. arXiv preprint arXiv:1904.11682.

Cited By

董今(2024)Cross Feature Engineering for Anti-Fraud Task in InsuranceArtificial Intelligence and Robotics Research10.12677/AIRR.2024.13204813:02(467-477)Online publication date: 2024
https://doi.org/10.12677/AIRR.2024.132048
Zhou QZhang PGu HLu TGu N(2024)Exploring Cross-Site User Modeling without Cross-Site User Identity Linkage: A Case Study of Content Preference PredictionACM Transactions on Information Systems10.1145/369783243:1(1-28)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3697832
Du ZWu CJia QZhu JChen X(2024)A Tutorial on Feature Interpretation in Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3687094(1281-1282)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3687094
Show More Cited By

Index Terms

AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications
1. Computing methodologies
  1. Machine learning

Recommendations

Deep & Cross Network for Ad Click Predictions
ADKDD'17: Proceedings of the ADKDD'17

Feature engineering has been the key to the success of many prediction models. However, the process is nontrivial and often requires manual feature engineering or exhaustive searching. DNNs are able to automatically learn feature interactions; however, ...
Human-in-the-Loop Feature Discovery for Tabular Data
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

In recent years, researchers have developed several methods to automate discovering datasets and augmenting features for training Machine Learning (ML) models. Together with feature selection, these efforts have paved the way towards what is termed the ...
Multimodal AutoML for Image, Text and Tabular Data
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Automated machine learning (AutoML) offers the promise of translating raw data into accurate predictions without the need for significant human effort, expertise, and manual experimentation. In this lecture-style tutorial, we demonstrate fundamental ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2019

3305 pages

ISBN:9781450362016

DOI:10.1145/3292500

General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '19

Sponsor:

KDD '19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 4 - 8, 2019

AK, Anchorage, USA

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
1,147
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)4

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

董今(2024)Cross Feature Engineering for Anti-Fraud Task in InsuranceArtificial Intelligence and Robotics Research10.12677/AIRR.2024.13204813:02(467-477)Online publication date: 2024
https://doi.org/10.12677/AIRR.2024.132048
Zhou QZhang PGu HLu TGu N(2024)Exploring Cross-Site User Modeling without Cross-Site User Identity Linkage: A Case Study of Content Preference PredictionACM Transactions on Information Systems10.1145/369783243:1(1-28)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3697832
Du ZWu CJia QZhu JChen X(2024)A Tutorial on Feature Interpretation in Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3687094(1281-1282)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3687094
Weng YTang XXu ZLyu FLiu DSun ZHe XSerra ESpezzano F(2024)OptDist: Learning Optimal Distribution for Customer Lifetime Value PredictionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679712(2523-2533)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679712
Qu YQu LChen TZhao XNguyen QYin HSerra ESpezzano F(2024)Scalable Dynamic Embedding Size Search for Streaming RecommendationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679638(1941-1950)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679638
Qu YChen TNguyen QYin HAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Budgeted Embedding Table For Recommender SystemsProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635778(557-566)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635778
Du ZChen JJia QWu CZhu JDong ZTang RChua TNgo CKumar RLauw HKa-Wei Lee R(2024)LightCS: Selecting Quadratic Feature Crosses in Linear ComplexityCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648300(38-46)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648300
Ye CLu GWang HLi LWu SChen GZhao JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Towards Cross-Table Masked Pretraining for Web Data MiningProceedings of the ACM Web Conference 202410.1145/3589334.3645707(4449-4459)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645707
Yao YLiu BHe HSheng DWang KXiao LCao H(2024)I-Razor: A Differentiable Neural Input Razor for Feature Selection and Dimension Search in DNN-Based Recommender SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.333267136:9(4736-4749)Online publication date: Sep-2024
https://doi.org/10.1109/TKDE.2023.3332671
Qi DZheng WWang J(2024)FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00146(1805-1818)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00146
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents