Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

Yun Yue¹³,
Yongchao Liu¹³,
Suo Tong¹³,
Minghao Li¹³,
Zhen Zhang¹³,
Chunyang Wen¹³,
Huanjun Bao¹³,
Lihong Gu¹³,
Jinjie Gu¹³ &
…
Yixiang Mu¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12977))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2048 Accesses
1 Citations

Abstract

We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 103.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 129.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

AMAdam: adaptive modifier of Adam method

Article 27 February 2024

PHN: Parallel Heterogeneous Network with Soft Gating for CTR Prediction

AutoShape: Automatic Design of Click-Through Rate Prediction Models Using Shapley Value

Notes

1.
The codes will be released if the paper is accepted.
2.
We only use the data from season 2 and 3 because of the same data schema.
3.
See https://github.com/Atomu2014/Ads-RecSys-Datasets/ for details.
4.
Limited by training resources available, we don’t use the optimal hyperparameter settings of [23].

References

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Keeton, K., Roscoe, T. (eds.) 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, 2–4 November 2016, pp. 265–283. USENIX Association (2016). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
Avazu: Avazu click-through rate prediction (2015). https://www.kaggle.com/c/avazu-ctr-prediction/data
Criteo: Criteo display ad challenge (2014). http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). https://doi.org/10.5555/1953048.2021068
Article MathSciNet MATH Google Scholar
Graepel, T., Candela, J.Q., Borchert, T., Herbrich, R.: Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning, ICML 2010, Haifa, Israel, 21–24 June 2010, pp. 13–20. Omnipress (2010). https://icml.cc/Conferences/2010/papers/901.pdf
Gupta, V., Koren, T., Singer, Y.: Shampoo: preconditioned stochastic tensor optimization. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018, vol. 80, pp. 1837–1845. PMLR (2018). http://proceedings.mlr.press/v80/gupta18a.html
Liao, H., Peng, L., Liu, Z., Shen, X.: IPinYou global RTB bidding algorithm competition (2013). https://www.kaggle.com/lastsummer/ipinyou
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA (2015)
Google Scholar
Littlestone, N.: From on-line to batch learning. In: Rivest, R.L., Haussler, D., Warmuth, M.K. (eds.) Proceedings of the 2nd Annual Workshop on Computational Learning Theory, COLT 1989, Santa Cruz, CA, USA, 31 July–2 August 1989, pp. 269–284. Morgan Kaufmann (1989). http://dl.acm.org/citation.cfm?id=93365
McMahan, H.B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and L1 regularization. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, FL, USA, vol. 15, pp. 525–533. PMLR (2011)
Google Scholar
McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, Illinois, USA, pp. 1222–1230. ACM (2013)
Google Scholar
McMahan, H.B., Streeter, M.J.: Adaptive bound optimization for online convex optimization. In: The 23rd Conference on Learning Theory, COLT 2010, Haifa, Israel, 27–29 June 2010, pp. 244–256. Omnipress (2010). http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf#page=252
Naumov, M., et al.: Deep learning recommendation model for personalization and recommendation systems. CoRR abs/1906.00091 (2019). http://arxiv.org/abs/1906.00091
Nesterov, Y.E.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)
Article MathSciNet Google Scholar
Nesterov, Y.E.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009). https://doi.org/10.1007/s10107-007-0149-x
Article MathSciNet MATH Google Scholar
Ni, X., et al.: Feature selection for Facebook feed ranking system via a group-sparsity-regularized training algorithm. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, pp. 2085–2088. ACM (2019)
Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964). https://doi.org/10.1016/0041-5553(64)90137-5
Article Google Scholar
Qu, Y., et al.: Product-based neural networks for user response prediction. In: Bonchi, F., Domingo-Ferrer, J., Baeza-Yates, R., Zhou, Z., Wu, X. (eds.) IEEE 16th International Conference on Data Mining, ICDM 2016, Barcelona, Spain, 12–15 December 2016, pp. 1149–1154. IEEE Computer Society (2016). https://doi.org/10.1109/ICDM.2016.0151
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. In: Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada. OpenReview.net (2018)
Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Statist. 22(3), 400–407 (1951)
Article MathSciNet Google Scholar
Rockafellar, R.T.: Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press (1970)
Google Scholar
Scardapane, S., Comminiello, D., Hussain, A., Uncini, A.: Group sparse regularization for deep neural networks. Neurocomputing 241, 43–52 (2016). https://doi.org/10.1016/j.neucom.2017.02.029
Article Google Scholar
Wang, R., Fu, B., Fu, G., Wang, M.: Deep & cross network for ad click predictions. In: Proceedings of the ADKDD 2017, Halifax, NS, Canada, 13–17 August 2017, pp. 12:1–12:7. ACM (2017). https://doi.org/10.1145/3124749.3124754
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010). https://doi.org/10.5555/1756006.1953017
Article MathSciNet MATH Google Scholar
Yang, H., Xu, Z., King, I., Lyu, M.R.: Online learning for group lasso. In: Proceedings of the 27th International Conference on Machine Learning, ICML 2010, Haifa, Israel, pp. 1191–1198. Omnipress (2010)
Google Scholar
Yao, Z., Gholami, A., Shen, S., Keutzer, K., Mahoney, M.W.: ADAHESSIAN: an adaptive second order optimizer for machine learning. CoRR abs/2006.00719 (2020). https://arxiv.org/abs/2006.00719
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012). https://arxiv.org/abs/1212.5701
Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. In: 6th International Conference on Learning Representations, ICLR 2018, Workshop Track Proceedings Vancouver, BC, Canada, 30 April–3 May 2018. OpenReview.net (2018). https://openreview.net/forum?id=Sy1iIDkPM
Appendix. https://github.com/yadandan/adaptive_optimizers_with_sparse_group_lasso/blob/master/appendix.pdf

Download references

Author information

Authors and Affiliations

Ant Group, No. 556 Xixi Road, Xihu District, Hangzhou, Zhejiang, China
Yun Yue, Yongchao Liu, Suo Tong, Minghao Li, Zhen Zhang, Chunyang Wen, Huanjun Bao, Lihong Gu, Jinjie Gu & Yixiang Mu

Authors

Yun Yue
View author publications
You can also search for this author in PubMed Google Scholar
Yongchao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Suo Tong
View author publications
You can also search for this author in PubMed Google Scholar
Minghao Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chunyang Wen
View author publications
You can also search for this author in PubMed Google Scholar
Huanjun Bao
View author publications
You can also search for this author in PubMed Google Scholar
Lihong Gu
View author publications
You can also search for this author in PubMed Google Scholar
Jinjie Gu
View author publications
You can also search for this author in PubMed Google Scholar
Yixiang Mu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Yue .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yue, Y. et al. (2021). Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-86523-8_19
Published: 11 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86522-1
Online ISBN: 978-3-030-86523-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now