TRBoost: a generic gradient boosting machine based on trust-region method

187 Accesses
Explore all metrics

Abstract

Gradient Boosting Machines (GBMs) have achieved remarkable success in effectively solving a wide range of problems by leveraging Taylor expansions in functional space. Second-order Taylor-based GBMs, such as XGBoost rooted in Newton’s method, consistently yield state-of-the-art results in practical applications. However, it is important to note that the loss functions used in second-order GBMs must strictly adhere to convexity requirements, specifically requiring a positive definite Hessian of the loss. This restriction significantly narrows the range of objectives, thus limiting the application scenarios. In contrast, first-order GBMs are based on the first-order gradient optimization method, enabling them to handle a diverse range of loss functions. Nevertheless, their performance may not always meet expectations. To overcome this limitation, we introduce Trust-region Boosting (TRBoost), a new and versatile Gradient Boosting Machine that combines the strengths of second-order GBMs and the versatility of first-order GBMs. In each iteration, TRBoost employs a constrained quadratic model to approximate the objective and applies the Trust-region algorithm to obtain a new learner. Unlike GBMs based on Newton’s method, TRBoost does not require a positive definite Hessian, enabling its application to more loss functions while achieving competitive performance similar to second-order algorithms. Convergence analysis and numerical experiments conducted in this study confirm that TRBoost exhibits similar versatility to first-order GBMs and delivers competitive results compared to second-order GBMs. Overall, TRBoost presents a promising approach that achieves a balance between performance and generality, rendering it a valuable addition to the toolkit of machine learning practitioners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Optimization by Gradient Boosting

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

Extreme Gradient Boosting with Squared Logistic Loss Function

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Code Availability

All the datasets and codes are made publicly available on GitHub at https://github.com/Luojiaqimath/TRBoost.

Notes

XGBoost: https://xgboost.readthedocs.io/en/stable/
LightGBM: https://lightgbm.readthedocs.io/en/latest/
GBDT & Random Forest & Decision Tree: https://scikit-learn.org/stable/index.html
https://github.com/Luojiaqimath/TRBoost

References

Ait Hammou B, Ait Lahcen A, Mouline S (2019) A distributed group recommendation system based on extreme gradient boosting and big data technologies. Appl Intell 49:4128–4149
Biau G, Cadre B, Rouvìère L (2019) Accelerated gradient boosting. Mach Learn 108(6):971–992
Article MathSciNet MATH Google Scholar
Bickel PJ, Ritov Y, Zakai A et al (2006) Some theory for generalized boosting algorithms. Journal of Machine Learning Research 7(5)
Borisov V, Leemann T, Seßler K et al (2022) Deep neural networks and tabular data: a survey. IEEE Transactions on Neural Networks and Learning Systems
Breiman L, Friedman JH, Olshen RA et al (2017) Classification and regression trees. Routledge
Book MATH Google Scholar
Buehlmann P (2006) Boosting for high-dimensional linear models. The Annals of Statistics 34(2):559–583
MathSciNet Google Scholar
Bühlmann P, Yu B (2003) Boosting with the l 2 loss: regression and classification. J Am Stat Assoc 98(462):324–339
Article MathSciNet MATH Google Scholar
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794
Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning, pp 233–240
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Duan T, Anand A, Ding DY et al (2020) Ngboost: Natural gradient boosting for probabilistic prediction. In: International conference on machine learning, PMLR, pp 2690–2700
Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
Article Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics 28(2):337–407
Article MathSciNet MATH Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp 1189–1232
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Article MathSciNet MATH Google Scholar
Ghosh A, Kumar H, Sastry PS (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence
Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-based models still outperform deep learning on typical tabular data. Adv Neural Info Process Syst 35:507–520
Google Scholar
Grubb A, Bagnell JA (2011) Generalized boosting algorithms for convex optimization. In: Proceedings of the 28th international conference on international conference on machine learning, pp 1209–1216
Hastie T, Tibshirani R, Friedman JH et al (2009) Elements of statistical learning: data mining, inference, and prediction, vol 2. Springer, New York
Book MATH Google Scholar
Huang H, Jia R, Shi X et al (2021) Feature selection and hyper parameters optimization for short-term wind power forecast. Appl Intel 1–19
Ke G, Meng Q, Finley T et al (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Info Process Syst 30
Liu Y, Guo H (2020) Peer loss functions: learning from noisy labels without knowing noise rates. In: International conference on machine learning, PMLR, pp 6226–6236
Lu H, Karimireddy SP, Ponomareva N et al (2020) Accelerating gradient boosting machines. In: International conference on artificial intelligence and statistics, PMLR, pp 516–526
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Frontiers in Neurorobotics 7:21
Article Google Scholar
Nocedal J, Wright SJ (1999) Numerical optimization. Springer, New York
Book MATH Google Scholar
Prokhorenkova L, Gusev G, Vorobev A et al (2018) Catboost: unbiased boosting with categorical features. Adv Neural Info Process Syst 31
Schmid M, Hothorn T (2008) Boosting additive models using component-wise p-splines. Comput Stat Data Anal 53(2):298–311
Article MathSciNet MATH Google Scholar
Shrivastav LK, Jha SK (2021) A gradient boosting machine learning approach in modeling the impact of temperature and humidity on the transmission rate of covid-19 in india. Appl Intell 51:2727–2739
Article Google Scholar
Sigrist F (2021) Gradient and newton boosting for classification and regression. Expert Syst Appl 167:114080
Article Google Scholar
Sun P, Zhang T, Zhou J (2014) A convergence rate analysis for logitboost, mart and their variant. In: International conference on machine learning, PMLR, pp 1251–1259
Telgarsky M, Singer Y (2012) A primal-dual convergence analysis of boosting. J Mach Learn Res 13(3)
Vanschoren J, Van Rijn JN, Bischl B et al (2014) Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15(2):49–60
Article Google Scholar
Verbraeken J, Wolting M, Katzy J et al (2020) A survey on distributed machine learning. Acm Comput Surv 53(2):1–33
Article Google Scholar
Wang Y, Ma X, Chen Z et al (2019) Symmetric cross entropy for robust learning with noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 322–330
Yuan Y (2000) A review of trust region algorithms for optimization. In: ICIAM, pp 271–282
Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Info Process Syst 31

Download references

Funding

This work was partially supported by the National Natural Science Foundation of China no. 12071190.

Author information

Authors and Affiliations

Data Science Research Center, Duke Kunshan University, No.8 Duke Ave., 215300, Kunshan, Jiangsu Province, China
Jiaqi Luo, Zihao Wei, Junkai Man & Shixin Xu

Authors

Jiaqi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Zihao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Junkai Man
View author publications
You can also search for this author in PubMed Google Scholar
Shixin Xu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jiaqi Luo: Methodology, Software, Writing$ - $original draft, Writing $ - $ review & editing. Zihao Wei: Software, Writing $ - $ original draft. Junkai Man: Software, Writing $ - $ original draft. Shixin Xu: Conceptualization, Methodology, Writing $ - $ original draft, Writing $ - $ review & editing.

Corresponding author

Correspondence to Shixin Xu.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of theorem 2

1.1 A.1 One-instance example

Denote the loss at the current iteration by $l=l^t(y, F)$ and that at the next iteration by $l^{+} = l^{t+1}(y, F+f)$. Suppose the steps of gradient descent GBMs, Newton’s GBMs, and TRBoost, are $-\nu g$, $\frac{-\nu g}{h}$, and $\frac{-g}{h+\mu }$, respectively. $\nu $ is the learning rate and is usually less than 1 to avoid overfitting.

For regression tasks with MAE loss, since the Hessian $h=0$, we have $l^{+}=l+gf$. Let $\mu =\frac{1}{v}$, then TRBoost degenerate into gradient descent and has the same convergence as the gradient descent algorithm, which is $O\left( \frac{1}{T}\right) $.

Proof

Since $g=1$ or -1, substituting $f=-\nu g$ to $l^{+}=l+gf$, we have

$$\begin{aligned} l^{+}=l-\nu g^2=l-\nu < l. \end{aligned}$$

(A1)

Because l is monotonically decreasing and $l > 0$, $\nu $ is a constant, thus there exists a constant $0<c<1 $ such that

$$\begin{aligned} c(l^t)^2 < \nu , \forall t. \end{aligned}$$

(A2)

Hence we have

$$\begin{aligned} l^{+}=l-\nu < l-cl^2. \end{aligned}$$

(A3)

According to the Theorem 3 in Appendix B, the $O(\frac{1}{T})$ convergence rate is obtained.$\square $

For MSE loss, it is a quadratic function. Hence just let $\nu =1$ and $\mu =0$, then follow the standard argument in numerical optimization book [26], both Newton’s method and trust-region method have the quadratic convergence.

We now discuss the convergence when the Hessian is not constant. We prove that the trust-region method has a linear convergence rate like Newton’s method with fewer constraints.

Proof

By the Mean Value Theorem, we have

$$\begin{aligned} l^+ = l + g f + \frac{1}{2}h_{\xi }f^2, \end{aligned}$$

(A4)

$\xi \in [0, f]$ and $h_{\xi }$ denotes the Hessian at $\xi $. Substituting $f=\frac{-g}{h+\mu }$ to (A4), we have

$$\begin{aligned} l^+ = l-\frac{g^2}{h+\mu } + \frac{1}{2}h_{\xi }\frac{g^2}{(h+\mu )^2}. \end{aligned}$$

(A5)

In paper [31], it needs $f \ge 0$ to ensure $\frac{h_{\xi }}{h} \le 1$. But in our method, we do not need this constraint. Since the Hessian $h=4p(1-p)$ (Definition 1) is bounded by the predicted probability p and a proper $\mu $ (for example $\mu =1$) can always make $\frac{h_{\xi }}{h+\mu } \le 1$.

By applying Theorem 7 we have

$$\begin{aligned} l^+= & {} l-\frac{g^2}{h+\mu } + \frac{1}{2}\frac{h_{\xi }}{h+\mu }\frac{g^2}{h+\mu } \nonumber \\\le & {} l-\frac{1}{2}\frac{g^2}{h+\mu } \nonumber \\= & {} l-\frac{1}{2}\frac{h}{h+\mu }\frac{g^2}{h} \nonumber \\\le & {} l-\frac{1}{2}\frac{h}{h+\mu }l \nonumber \\= & {} \left( 1-\frac{1}{2}\frac{h}{h+\mu }\right) l. \end{aligned}$$

(A6)

For any $\mu $ that satisfies the inequality $\frac{h_{\xi }}{h+\mu } \le 1$ and the Assumption 2, $0< (1-\frac{1}{2}\frac{h}{h+\mu }) < 1$ always holds, hence we obtain the linear convergence according to Theorem 4.$\square $

1.2 A.2 Tree model case

We omit the proof for regression tasks with different loss functions since it is the same as the One-instance case.

Proof

Define the total loss $L(f)=\sum _{i=1}^{n}l(y_i, F_i+f_i)$. Write it in matrix form and apply the Mean Value Theorem, we have

$$\begin{aligned} L(f) = L(0) + \textbf{g}^{\intercal }\textbf{f} + \frac{1}{2} \textbf{f}^{\intercal }\textbf{H}_{\xi }\textbf{f}, \end{aligned}$$

(A7)

where $\textbf{g}=(g_1, \cdots , g_{n} )^{\intercal }$, $\textbf{H}=diag(h_1,\cdots ,h_{n})$ denote the instance wise gradient and Hessian, respectively. In a decision tree, instances falling into the same leaf must share a common fitted value. We can thus write down $\textbf{f}=\textbf{V}\textbf{s}$, where $ \textbf{s} = (s_1, \cdots , s_j, \cdots , s_J)^{\intercal }\in \mathbb {R}^{J}$ are the fitted values on the J leaves and $\textbf{V}\in \mathbb {R}^{n\times J}$ is a projection matrix $\textbf{V}=[\textbf{v}_1, \cdots , \textbf{v}_j, \cdots , \textbf{v}_{J}]$, where $\textbf{v}_{j, i}=1$ if the jth leaf contains the ith instance and 0 otherwise. Substituting $\textbf{f}=\textbf{V}\textbf{s}$ to L(f) and reload the notation $L(\cdot )$ for $\textbf{s}$ we have

$$\begin{aligned} L(\textbf{s}) = L(0) + (\textbf{g}^{\intercal }\textbf{V})\textbf{s} + \frac{1}{2} \textbf{s}^{\intercal }(\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\textbf{s}. \end{aligned}$$

(A8)

Denote $L(\textbf{s})$ by $L^+$ and L(0) by L and replace $\textbf{f}$ with the trust-region step

$$\begin{aligned} \textbf{f}= & {} -\textbf{V}\textbf{s} \nonumber \\= & {} -\textbf{V}(\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I})^{-1} \textbf{V}^{\intercal } \textbf{g}, \end{aligned}$$

(A9)

then we have

$$\begin{aligned} L^{+}= & {} L-(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g} \nonumber \\{} & {} + \frac{1}{2} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}(\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g}, \end{aligned}$$

(A10)

where $\hat{\textbf{H}}=\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I}$, I is the identity matrix.

Using the Lemmas 4, 5, and Corollary 1 in the Appendix, we obtain the following inequality

$$\begin{aligned} L^{+}\le & {} L-(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g}\nonumber \\{} & {} + \frac{\tau }{2}(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g})~~~(Lemma~4)\nonumber \\= & {} L-(1-\frac{\tau }{2})(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g} \nonumber \\\le & {} L-\lambda \gamma _{*}^2 (1-\frac{\tau }{2})\textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g} ~~~ (Lemma ~5)\nonumber \\\le & {} L-\lambda \gamma _{*}^2 (1-\frac{\tau }{2})L ~~~(Corollary~1) \nonumber \\= & {} (1-\lambda \gamma _{*}^2 (1-\frac{\tau }{2}))L. \end{aligned}$$

(A11)

Since $\tau $ and $\gamma _{*}$ are constants, $0 < \lambda \le \frac{h^{min}}{n(h^{max}+\mu )}$, a proper $\lambda $ can make $0< \lambda \gamma _{*}^2 (1-\frac{\tau }{2}) < 1$ . According to the Theorem 4 in the Appendix, the linear rate is achieved. $\square $

Appendix B: Related theory

1.1 B.1 Existing theoretical results

In this subsection, we introduce some related results proposed in the paper [31].

Definition 1

The instance wise loss $l(\cdot , \cdot )$ is defined as

$$\begin{aligned} l(y, F) = ylog(\frac{1}{p})+(1-y)log(\frac{1}{1-p}), \end{aligned}$$

(B12)

where $y \in \{0, 1\}$ and the probability estimate $p \in [0, 1]$ is computed by a sigmoid-like function on F

$$\begin{aligned} p = \psi (F)=\frac{e^F}{e^F+e^{-F}}. \end{aligned}$$

(B13)

And the gradient and Hessian of $l(\cdot , F)$ are given as

$$\begin{aligned} g(F) = 2(p-y), h(F)=4p(1-p). \end{aligned}$$

(B14)

Assumption 1

In order to avoid the numerical stability problems, we do a clapping on the output probability $p_i, i=1,\cdots ,n$:

$$\begin{aligned} p_i=\left\{ \begin{array}{l} 1-\rho , (y_i=0)\wedge (p_i > 1-\rho ) \\ \rho , (y_i=1)\wedge (p_i < \rho )\\ p_i, otherwise \end{array} \right. \end{aligned}$$

for some small constant $\rho $.

Theorem 3

If a sequence $\epsilon _0 \ge \cdots \ge \epsilon _{t-1} \ge \epsilon _{t} \ge \cdots \ge \epsilon _{T} > 0$ has the recurrence relation $\epsilon _t \le \epsilon _{t-1}-c\epsilon _{t-1}^2$ for a small constant $c>0$, then we have $\frac{1}{T}$ convergence rate: $\epsilon _T \le \frac{\epsilon _0}{1+c\epsilon _0T}$.

Theorem 4

If a sequence $\epsilon _0 \ge \cdots \ge \epsilon _{t-1} \ge \epsilon _{t} \ge \cdots \ge \epsilon _{T} > 0$ has the recurrence relation $\epsilon _t \le c\epsilon _{t-1}$ for a small constant $0<c<1$, then we have linear convergence rate: $\epsilon _T \le \epsilon _0 c^T$.

Theorem 5

With the value clapping on $p_i, i=1,\cdots ,n$, the Newton step $-\frac{\bar{g}_j}{\bar{h}_j}$ is bounded such that $|\bar{g}_j /\bar{h}_j| \le 1/(2\rho )$. Here $\bar{g}_j=\sum _{i\in I_j}g_i$ and $\bar{h}_j=\sum _{i\in I_j}h_i$ denote the sum of gradient and Hessian in jth tree node.

Theorem 6

For GBMs, it has $O(\frac{1}{T})$ rate when using gradient descent, while a linear rate is achieved when using Newton descent.

Theorem 7

Let g, h, and l be the shorthand for gradient, Hessian, and loss, respectively. Then $\forall p$ (and thus $\forall F$), the inequality $\frac{g^2}{h}\ge l$ always holds.

Corollary 1

Connect Newton’s objective reduction and loss reduction using Theorem 7, we have

$$\begin{aligned} \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g} \ge L. \end{aligned}$$

(B15)

Lemma 1

For a set of n instances with labels $\{0, 1\}^n$ and non-negative weights $\{w_1, \cdots , w_n\}$, there exists a J-leaf classification tree (i.e., outputting $\{0, 1\}$ at each leaf) such that the weighted error rate at each leaf is strictly less than $\frac{1}{2}$ by at least $\delta > 0$.

Lemma 2

Let $\textbf{g}=(g_1, \cdots , g_n )^{\intercal }$, $\textbf{H}=diag(h_1,\cdots ,h_n)$. If the weak Learnability Assumption 1 holds, then there exists a J-leaf regression tree whose projection matrix $\textbf{V}\in \mathbb {R}^{n\times J}$ satisfies

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }(\textbf{V}^{\intercal }\textbf{H}\textbf{V})(\textbf{V}^{\intercal }\textbf{g}) \ge \gamma _{*}^2 \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g}, \end{aligned}$$

(B16)

where $\gamma _{*}^2 = \frac{4\delta ^2\rho }{n}$.

Lemma 3

For the current $F \in \mathbb {R}$ and a step $f \in \mathbb {R}$, the change of the Hessian can be bounded in terms of step size |f|:

$$\begin{aligned} \frac{h(F+f)}{h(f)} \le e^{2|f|}. \end{aligned}$$

(B17)

1.2 B.2 New theoretical results proposed in this paper

Assumption 2

In the Algorithm 2, we can see that $\mu ^{t}$ is monotonically increasing, and it will become infinite as the number of iterations increase, which makes the step $\textbf{p}_k$ go to 0. In order to avoid this situation, in practice we assume $\mu ^t$ is bounded, i.e., $\mu ^t \in [\mu ^0, \mu ^{\max }]$, $\forall t$.

Theorem 8

Since $\mu $ and Hessian h are bounded, there exists a constant $\lambda $ that satisfies

$$\begin{aligned} \frac{1}{\sum _{j\in I_j}(h_j+\mu )} \ge \lambda \frac{1}{\sum _{j\in I_j}h_j}, \forall j=1,\cdots ,J. \end{aligned}$$

(B18)

Proof

Let $h^{\max }$ and $h^{\min }$ be the maximum and minimum values of the Hessian respectively. The minimum value on the left side of the inequality is $\frac{1}{n(h^{\max }+\mu )}$, and the maximum value on the right side is $\frac{1}{h^{\min }}$. Just let

$$\begin{aligned} \lambda \le \frac{h^{\min }}{n(h^{\max }+\mu )}, \end{aligned}$$

(19)

then the inequality always holds.$\square $

Lemma 4

Based on Assumption 1, we have

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\hat{\textbf{H}}^{-1} \le \tau I. \end{aligned}$$

(B20)

Proof

For the jth leaf $(j = 1, \cdots , J)$, the node wise Hessian before update is

$$\begin{aligned} \bar{h}_j=\sum _{i \in I_j}(h(F_i)+\mu ) \end{aligned}$$

(B21)

while the mean value is

$$\begin{aligned} \bar{h}_{\xi , j}=\sum _{i=I_j}h(F_i+\eta _j), \end{aligned}$$

(B22)

where $\eta _j$ lies between 0 and the trust-region step $\frac{\sum _{i=I_j}g(F_i)}{\sum _{i \in I_j}(h(F_i)+\mu )}$. Applying Lemma 3, we have

$$\begin{aligned} \frac{\bar{h}_{\xi , j}}{\bar{h}_j}= & {} \frac{\sum _{i=I_j}h(F_i+\eta _j)}{\sum _{i \in I_j}(h(F_i)+\mu )} \nonumber \\\le & {} \frac{\sum _{i=I_j}h(F_i+\eta _j)}{\sum _{i \in I_j}h(F_i)} \nonumber \\\le & {} \frac{\sum _{i=I_j}h(F_i)\cdot e^{2\eta _j}}{\sum _{i \in I_j}h(F_i)} \nonumber \\= & {} \exp ^{2\eta _j}. \end{aligned}$$

(B23)

According to Theorem 5, we have

$$\begin{aligned} |\eta _j| \le \frac{\sum _{i\in I_j}g(F_i)}{\sum _{i \in I_j}(h(F_i)+\mu )} \le \frac{\sum _{i\in I_j}g(F_i)}{\sum _{i \in I_j}h(F_i} \le \frac{1}{2\rho }. \end{aligned}$$

(B24)

Rewriting the matrix form, we have

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\hat{\textbf{H}}^{-1} \le \tau I, \end{aligned}$$

(B25)

where $\tau = e^{\frac{1}{2\rho }}$.$\square $

Lemma 5

Let $\textbf{g}=(g_1, \cdots , g_n )^{\intercal }$, $\textbf{H}=diag(h_1,\cdots ,h_n)$, $\hat{\textbf{H}}=\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I}$. If the weak Learnability Assumption 2 holds, then there exists a J-leaf regression tree whose projection matrix $\textbf{V}\in \mathbb {R}^{n\times J}$ satisfies

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}(\textbf{V}^{\intercal }\textbf{g}) \ge \gamma \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g} \end{aligned}$$

(B26)

for some constant $\gamma > 0$.

Proof

For each leaf j, according to Lemma 2, we have

$$\begin{aligned} \frac{(\sum _{i\in I_j}g_i)^2}{\sum _{i \in I_j}h_i} \ge \gamma _{*}^2 \sum _{i=I_j}\frac{g_i^2}{h_i} \end{aligned}$$

(B27)

By Theorem 8, for some $\lambda $, we obtain

$$\begin{aligned} \frac{(\sum _{i\in I_j}g_i)^2}{\sum _{i \in I_j}(h_i+\mu )} \ge \lambda \frac{(\sum _{i\in I_j}g_i)^2}{\sum _{i \in I_j}h_i} \ge \lambda \gamma _{*}^2 \sum _{i=I_j}\frac{g_i^2}{h_i}. \end{aligned}$$

(B28)

Rewriting in the matrix form, we get the final result

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}(\textbf{V}^{\intercal }\textbf{g}) \ge \lambda \gamma _{*}^2 \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g}. \end{aligned}$$

(B29)

$\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Luo, J., Wei, Z., Man, J. et al. TRBoost: a generic gradient boosting machine based on trust-region method. Appl Intell 53, 27876–27891 (2023). https://doi.org/10.1007/s10489-023-05000-w

Download citation

Accepted: 03 September 2023
Published: 19 September 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10489-023-05000-w

TRBoost: a generic gradient boosting machine based on trust-region method

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimization by Gradient Boosting

A comparative analysis of gradient boosting algorithms

Extreme Gradient Boosting with Squared Logistic Loss Function

Explore related subjects

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendices

Appendix A: Proof of theorem 2

1.1 A.1 One-instance example

Proof

Proof

1.2 A.2 Tree model case

Proof

Appendix B: Related theory

1.1 B.1 Existing theoretical results

Definition 1

Assumption 1

Theorem 3

Theorem 4

Theorem 5

Theorem 6

Theorem 7

Corollary 1

Lemma 1

Lemma 2

Lemma 3

1.2 B.2 New theoretical results proposed in this paper

Assumption 2

Theorem 8

Proof

Lemma 4

Proof

Lemma 5

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation