Abstract
Gradient Boosting Machines (GBMs) have achieved remarkable success in effectively solving a wide range of problems by leveraging Taylor expansions in functional space. Second-order Taylor-based GBMs, such as XGBoost rooted in Newton’s method, consistently yield state-of-the-art results in practical applications. However, it is important to note that the loss functions used in second-order GBMs must strictly adhere to convexity requirements, specifically requiring a positive definite Hessian of the loss. This restriction significantly narrows the range of objectives, thus limiting the application scenarios. In contrast, first-order GBMs are based on the first-order gradient optimization method, enabling them to handle a diverse range of loss functions. Nevertheless, their performance may not always meet expectations. To overcome this limitation, we introduce Trust-region Boosting (TRBoost), a new and versatile Gradient Boosting Machine that combines the strengths of second-order GBMs and the versatility of first-order GBMs. In each iteration, TRBoost employs a constrained quadratic model to approximate the objective and applies the Trust-region algorithm to obtain a new learner. Unlike GBMs based on Newton’s method, TRBoost does not require a positive definite Hessian, enabling its application to more loss functions while achieving competitive performance similar to second-order algorithms. Convergence analysis and numerical experiments conducted in this study confirm that TRBoost exhibits similar versatility to first-order GBMs and delivers competitive results compared to second-order GBMs. Overall, TRBoost presents a promising approach that achieves a balance between performance and generality, rendering it a valuable addition to the toolkit of machine learning practitioners.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Code Availability
All the datasets and codes are made publicly available on GitHub at https://github.com/Luojiaqimath/TRBoost.
Notes
GBDT & Random Forest & Decision Tree: https://scikit-learn.org/stable/index.html
References
Ait Hammou B, Ait Lahcen A, Mouline S (2019) A distributed group recommendation system based on extreme gradient boosting and big data technologies. Appl Intell 49:4128–4149
Biau G, Cadre B, Rouvìère L (2019) Accelerated gradient boosting. Mach Learn 108(6):971–992
Bickel PJ, Ritov Y, Zakai A et al (2006) Some theory for generalized boosting algorithms. Journal of Machine Learning Research 7(5)
Borisov V, Leemann T, Seßler K et al (2022) Deep neural networks and tabular data: a survey. IEEE Transactions on Neural Networks and Learning Systems
Breiman L, Friedman JH, Olshen RA et al (2017) Classification and regression trees. Routledge
Buehlmann P (2006) Boosting for high-dimensional linear models. The Annals of Statistics 34(2):559–583
Bühlmann P, Yu B (2003) Boosting with the l 2 loss: regression and classification. J Am Stat Assoc 98(462):324–339
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794
Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning, pp 233–240
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Duan T, Anand A, Ding DY et al (2020) Ngboost: Natural gradient boosting for probabilistic prediction. In: International conference on machine learning, PMLR, pp 2690–2700
Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics 28(2):337–407
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp 1189–1232
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Ghosh A, Kumar H, Sastry PS (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence
Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-based models still outperform deep learning on typical tabular data. Adv Neural Info Process Syst 35:507–520
Grubb A, Bagnell JA (2011) Generalized boosting algorithms for convex optimization. In: Proceedings of the 28th international conference on international conference on machine learning, pp 1209–1216
Hastie T, Tibshirani R, Friedman JH et al (2009) Elements of statistical learning: data mining, inference, and prediction, vol 2. Springer, New York
Huang H, Jia R, Shi X et al (2021) Feature selection and hyper parameters optimization for short-term wind power forecast. Appl Intel 1–19
Ke G, Meng Q, Finley T et al (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Info Process Syst 30
Liu Y, Guo H (2020) Peer loss functions: learning from noisy labels without knowing noise rates. In: International conference on machine learning, PMLR, pp 6226–6236
Lu H, Karimireddy SP, Ponomareva N et al (2020) Accelerating gradient boosting machines. In: International conference on artificial intelligence and statistics, PMLR, pp 516–526
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Frontiers in Neurorobotics 7:21
Nocedal J, Wright SJ (1999) Numerical optimization. Springer, New York
Prokhorenkova L, Gusev G, Vorobev A et al (2018) Catboost: unbiased boosting with categorical features. Adv Neural Info Process Syst 31
Schmid M, Hothorn T (2008) Boosting additive models using component-wise p-splines. Comput Stat Data Anal 53(2):298–311
Shrivastav LK, Jha SK (2021) A gradient boosting machine learning approach in modeling the impact of temperature and humidity on the transmission rate of covid-19 in india. Appl Intell 51:2727–2739
Sigrist F (2021) Gradient and newton boosting for classification and regression. Expert Syst Appl 167:114080
Sun P, Zhang T, Zhou J (2014) A convergence rate analysis for logitboost, mart and their variant. In: International conference on machine learning, PMLR, pp 1251–1259
Telgarsky M, Singer Y (2012) A primal-dual convergence analysis of boosting. J Mach Learn Res 13(3)
Vanschoren J, Van Rijn JN, Bischl B et al (2014) Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15(2):49–60
Verbraeken J, Wolting M, Katzy J et al (2020) A survey on distributed machine learning. Acm Comput Surv 53(2):1–33
Wang Y, Ma X, Chen Z et al (2019) Symmetric cross entropy for robust learning with noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 322–330
Yuan Y (2000) A review of trust region algorithms for optimization. In: ICIAM, pp 271–282
Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Info Process Syst 31
Funding
This work was partially supported by the National Natural Science Foundation of China no. 12071190.
Author information
Authors and Affiliations
Contributions
Jiaqi Luo: Methodology, Software, Writing\( - \)original draft, Writing \( - \) review & editing. Zihao Wei: Software, Writing \( - \) original draft. Junkai Man: Software, Writing \( - \) original draft. Shixin Xu: Conceptualization, Methodology, Writing \( - \) original draft, Writing \( - \) review & editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of theorem 2
1.1 A.1 One-instance example
Denote the loss at the current iteration by \(l=l^t(y, F)\) and that at the next iteration by \(l^{+} = l^{t+1}(y, F+f)\). Suppose the steps of gradient descent GBMs, Newton’s GBMs, and TRBoost, are \(-\nu g\), \(\frac{-\nu g}{h}\), and \(\frac{-g}{h+\mu }\), respectively. \(\nu \) is the learning rate and is usually less than 1 to avoid overfitting.
For regression tasks with MAE loss, since the Hessian \(h=0\), we have \(l^{+}=l+gf\). Let \(\mu =\frac{1}{v}\), then TRBoost degenerate into gradient descent and has the same convergence as the gradient descent algorithm, which is \(O\left( \frac{1}{T}\right) \).
Proof
Since \(g=1\) or -1, substituting \(f=-\nu g\) to \(l^{+}=l+gf\), we have
Because l is monotonically decreasing and \(l > 0\), \(\nu \) is a constant, thus there exists a constant \(0<c<1 \) such that
Hence we have
According to the Theorem 3 in Appendix B, the \(O(\frac{1}{T})\) convergence rate is obtained.\(\square \)
For MSE loss, it is a quadratic function. Hence just let \(\nu =1\) and \(\mu =0\), then follow the standard argument in numerical optimization book [26], both Newton’s method and trust-region method have the quadratic convergence.
We now discuss the convergence when the Hessian is not constant. We prove that the trust-region method has a linear convergence rate like Newton’s method with fewer constraints.
Proof
By the Mean Value Theorem, we have
\(\xi \in [0, f]\) and \(h_{\xi }\) denotes the Hessian at \(\xi \). Substituting \(f=\frac{-g}{h+\mu }\) to (A4), we have
In paper [31], it needs \(f \ge 0\) to ensure \(\frac{h_{\xi }}{h} \le 1\). But in our method, we do not need this constraint. Since the Hessian \(h=4p(1-p)\) (Definition 1) is bounded by the predicted probability p and a proper \(\mu \) (for example \(\mu =1\)) can always make \(\frac{h_{\xi }}{h+\mu } \le 1\).
By applying Theorem 7 we have
For any \(\mu \) that satisfies the inequality \(\frac{h_{\xi }}{h+\mu } \le 1\) and the Assumption 2, \(0< (1-\frac{1}{2}\frac{h}{h+\mu }) < 1\) always holds, hence we obtain the linear convergence according to Theorem 4.\(\square \)
1.2 A.2 Tree model case
We omit the proof for regression tasks with different loss functions since it is the same as the One-instance case.
Proof
Define the total loss \(L(f)=\sum _{i=1}^{n}l(y_i, F_i+f_i)\). Write it in matrix form and apply the Mean Value Theorem, we have
where \(\textbf{g}=(g_1, \cdots , g_{n} )^{\intercal }\), \(\textbf{H}=diag(h_1,\cdots ,h_{n})\) denote the instance wise gradient and Hessian, respectively. In a decision tree, instances falling into the same leaf must share a common fitted value. We can thus write down \(\textbf{f}=\textbf{V}\textbf{s}\), where \( \textbf{s} = (s_1, \cdots , s_j, \cdots , s_J)^{\intercal }\in \mathbb {R}^{J}\) are the fitted values on the J leaves and \(\textbf{V}\in \mathbb {R}^{n\times J}\) is a projection matrix \(\textbf{V}=[\textbf{v}_1, \cdots , \textbf{v}_j, \cdots , \textbf{v}_{J}]\), where \(\textbf{v}_{j, i}=1\) if the jth leaf contains the ith instance and 0 otherwise. Substituting \(\textbf{f}=\textbf{V}\textbf{s}\) to L(f) and reload the notation \(L(\cdot )\) for \(\textbf{s}\) we have
Denote \(L(\textbf{s})\) by \(L^+\) and L(0) by L and replace \(\textbf{f}\) with the trust-region step
then we have
where \(\hat{\textbf{H}}=\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I}\), I is the identity matrix.
Using the Lemmas 4, 5, and Corollary 1 in the Appendix, we obtain the following inequality
Since \(\tau \) and \(\gamma _{*}\) are constants, \(0 < \lambda \le \frac{h^{min}}{n(h^{max}+\mu )}\), a proper \(\lambda \) can make \(0< \lambda \gamma _{*}^2 (1-\frac{\tau }{2}) < 1\) . According to the Theorem 4 in the Appendix, the linear rate is achieved. \(\square \)
Appendix B: Related theory
1.1 B.1 Existing theoretical results
In this subsection, we introduce some related results proposed in the paper [31].
Definition 1
The instance wise loss \(l(\cdot , \cdot )\) is defined as
where \(y \in \{0, 1\}\) and the probability estimate \(p \in [0, 1]\) is computed by a sigmoid-like function on F
And the gradient and Hessian of \(l(\cdot , F)\) are given as
Assumption 1
In order to avoid the numerical stability problems, we do a clapping on the output probability \(p_i, i=1,\cdots ,n\):
for some small constant \(\rho \).
Theorem 3
If a sequence \(\epsilon _0 \ge \cdots \ge \epsilon _{t-1} \ge \epsilon _{t} \ge \cdots \ge \epsilon _{T} > 0\) has the recurrence relation \(\epsilon _t \le \epsilon _{t-1}-c\epsilon _{t-1}^2\) for a small constant \(c>0\), then we have \(\frac{1}{T}\) convergence rate: \(\epsilon _T \le \frac{\epsilon _0}{1+c\epsilon _0T}\).
Theorem 4
If a sequence \(\epsilon _0 \ge \cdots \ge \epsilon _{t-1} \ge \epsilon _{t} \ge \cdots \ge \epsilon _{T} > 0\) has the recurrence relation \(\epsilon _t \le c\epsilon _{t-1}\) for a small constant \(0<c<1\), then we have linear convergence rate: \(\epsilon _T \le \epsilon _0 c^T\).
Theorem 5
With the value clapping on \(p_i, i=1,\cdots ,n\), the Newton step \(-\frac{\bar{g}_j}{\bar{h}_j}\) is bounded such that \(|\bar{g}_j /\bar{h}_j| \le 1/(2\rho )\). Here \(\bar{g}_j=\sum _{i\in I_j}g_i\) and \(\bar{h}_j=\sum _{i\in I_j}h_i\) denote the sum of gradient and Hessian in jth tree node.
Theorem 6
For GBMs, it has \(O(\frac{1}{T})\) rate when using gradient descent, while a linear rate is achieved when using Newton descent.
Theorem 7
Let g, h, and l be the shorthand for gradient, Hessian, and loss, respectively. Then \(\forall p\) (and thus \(\forall F\)), the inequality \(\frac{g^2}{h}\ge l\) always holds.
Corollary 1
Connect Newton’s objective reduction and loss reduction using Theorem 7, we have
Lemma 1
For a set of n instances with labels \(\{0, 1\}^n\) and non-negative weights \(\{w_1, \cdots , w_n\}\), there exists a J-leaf classification tree (i.e., outputting \(\{0, 1\}\) at each leaf) such that the weighted error rate at each leaf is strictly less than \(\frac{1}{2}\) by at least \(\delta > 0\).
Lemma 2
Let \(\textbf{g}=(g_1, \cdots , g_n )^{\intercal }\), \(\textbf{H}=diag(h_1,\cdots ,h_n)\). If the weak Learnability Assumption 1 holds, then there exists a J-leaf regression tree whose projection matrix \(\textbf{V}\in \mathbb {R}^{n\times J}\) satisfies
where \(\gamma _{*}^2 = \frac{4\delta ^2\rho }{n}\).
Lemma 3
For the current \(F \in \mathbb {R}\) and a step \(f \in \mathbb {R}\), the change of the Hessian can be bounded in terms of step size |f|:
1.2 B.2 New theoretical results proposed in this paper
Assumption 2
In the Algorithm 2, we can see that \(\mu ^{t}\) is monotonically increasing, and it will become infinite as the number of iterations increase, which makes the step \(\textbf{p}_k\) go to 0. In order to avoid this situation, in practice we assume \(\mu ^t\) is bounded, i.e., \(\mu ^t \in [\mu ^0, \mu ^{\max }]\), \(\forall t\).
Theorem 8
Since \(\mu \) and Hessian h are bounded, there exists a constant \(\lambda \) that satisfies
Proof
Let \(h^{\max }\) and \(h^{\min }\) be the maximum and minimum values of the Hessian respectively. The minimum value on the left side of the inequality is \(\frac{1}{n(h^{\max }+\mu )}\), and the maximum value on the right side is \(\frac{1}{h^{\min }}\). Just let
then the inequality always holds.\(\square \)
Lemma 4
Based on Assumption 1, we have
Proof
For the jth leaf \((j = 1, \cdots , J)\), the node wise Hessian before update is
while the mean value is
where \(\eta _j\) lies between 0 and the trust-region step \(\frac{\sum _{i=I_j}g(F_i)}{\sum _{i \in I_j}(h(F_i)+\mu )}\). Applying Lemma 3, we have
According to Theorem 5, we have
Rewriting the matrix form, we have
where \(\tau = e^{\frac{1}{2\rho }}\).\(\square \)
Lemma 5
Let \(\textbf{g}=(g_1, \cdots , g_n )^{\intercal }\), \(\textbf{H}=diag(h_1,\cdots ,h_n)\), \(\hat{\textbf{H}}=\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I}\). If the weak Learnability Assumption 2 holds, then there exists a J-leaf regression tree whose projection matrix \(\textbf{V}\in \mathbb {R}^{n\times J}\) satisfies
for some constant \(\gamma > 0\).
Proof
For each leaf j, according to Lemma 2, we have
By Theorem 8, for some \(\lambda \), we obtain
Rewriting in the matrix form, we get the final result
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Luo, J., Wei, Z., Man, J. et al. TRBoost: a generic gradient boosting machine based on trust-region method. Appl Intell 53, 27876–27891 (2023). https://doi.org/10.1007/s10489-023-05000-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05000-w