[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

TRBoost: a generic gradient boosting machine based on trust-region method

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Gradient Boosting Machines (GBMs) have achieved remarkable success in effectively solving a wide range of problems by leveraging Taylor expansions in functional space. Second-order Taylor-based GBMs, such as XGBoost rooted in Newton’s method, consistently yield state-of-the-art results in practical applications. However, it is important to note that the loss functions used in second-order GBMs must strictly adhere to convexity requirements, specifically requiring a positive definite Hessian of the loss. This restriction significantly narrows the range of objectives, thus limiting the application scenarios. In contrast, first-order GBMs are based on the first-order gradient optimization method, enabling them to handle a diverse range of loss functions. Nevertheless, their performance may not always meet expectations. To overcome this limitation, we introduce Trust-region Boosting (TRBoost), a new and versatile Gradient Boosting Machine that combines the strengths of second-order GBMs and the versatility of first-order GBMs. In each iteration, TRBoost employs a constrained quadratic model to approximate the objective and applies the Trust-region algorithm to obtain a new learner. Unlike GBMs based on Newton’s method, TRBoost does not require a positive definite Hessian, enabling its application to more loss functions while achieving competitive performance similar to second-order algorithms. Convergence analysis and numerical experiments conducted in this study confirm that TRBoost exhibits similar versatility to first-order GBMs and delivers competitive results compared to second-order GBMs. Overall, TRBoost presents a promising approach that achieves a balance between performance and generality, rendering it a valuable addition to the toolkit of machine learning practitioners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Code Availability

All the datasets and codes are made publicly available on GitHub at https://github.com/Luojiaqimath/TRBoost.

Notes

  1. XGBoost: https://xgboost.readthedocs.io/en/stable/

  2. LightGBM: https://lightgbm.readthedocs.io/en/latest/

  3. GBDT & Random Forest & Decision Tree: https://scikit-learn.org/stable/index.html

  4. https://github.com/Luojiaqimath/TRBoost

References

  1. Ait Hammou B, Ait Lahcen A, Mouline S (2019) A distributed group recommendation system based on extreme gradient boosting and big data technologies. Appl Intell 49:4128–4149

  2. Biau G, Cadre B, Rouvìère L (2019) Accelerated gradient boosting. Mach Learn 108(6):971–992

    Article  MathSciNet  MATH  Google Scholar 

  3. Bickel PJ, Ritov Y, Zakai A et al (2006) Some theory for generalized boosting algorithms. Journal of Machine Learning Research 7(5)

  4. Borisov V, Leemann T, Seßler K et al (2022) Deep neural networks and tabular data: a survey. IEEE Transactions on Neural Networks and Learning Systems

  5. Breiman L, Friedman JH, Olshen RA et al (2017) Classification and regression trees. Routledge

    Book  MATH  Google Scholar 

  6. Buehlmann P (2006) Boosting for high-dimensional linear models. The Annals of Statistics 34(2):559–583

    MathSciNet  Google Scholar 

  7. Bühlmann P, Yu B (2003) Boosting with the l 2 loss: regression and classification. J Am Stat Assoc 98(462):324–339

    Article  MathSciNet  MATH  Google Scholar 

  8. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794

  9. Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning, pp 233–240

  10. Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  11. Duan T, Anand A, Ding DY et al (2020) Ngboost: Natural gradient boosting for probabilistic prediction. In: International conference on machine learning, PMLR, pp 2690–2700

  12. Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    Article  Google Scholar 

  13. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  MATH  Google Scholar 

  14. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics 28(2):337–407

    Article  MathSciNet  MATH  Google Scholar 

  15. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp 1189–1232

  16. Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  MathSciNet  MATH  Google Scholar 

  17. Ghosh A, Kumar H, Sastry PS (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence

  18. Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-based models still outperform deep learning on typical tabular data. Adv Neural Info Process Syst 35:507–520

    Google Scholar 

  19. Grubb A, Bagnell JA (2011) Generalized boosting algorithms for convex optimization. In: Proceedings of the 28th international conference on international conference on machine learning, pp 1209–1216

  20. Hastie T, Tibshirani R, Friedman JH et al (2009) Elements of statistical learning: data mining, inference, and prediction, vol 2. Springer, New York

    Book  MATH  Google Scholar 

  21. Huang H, Jia R, Shi X et al (2021) Feature selection and hyper parameters optimization for short-term wind power forecast. Appl Intel 1–19

  22. Ke G, Meng Q, Finley T et al (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Info Process Syst 30

  23. Liu Y, Guo H (2020) Peer loss functions: learning from noisy labels without knowing noise rates. In: International conference on machine learning, PMLR, pp 6226–6236

  24. Lu H, Karimireddy SP, Ponomareva N et al (2020) Accelerating gradient boosting machines. In: International conference on artificial intelligence and statistics, PMLR, pp 516–526

  25. Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Frontiers in Neurorobotics 7:21

    Article  Google Scholar 

  26. Nocedal J, Wright SJ (1999) Numerical optimization. Springer, New York

    Book  MATH  Google Scholar 

  27. Prokhorenkova L, Gusev G, Vorobev A et al (2018) Catboost: unbiased boosting with categorical features. Adv Neural Info Process Syst 31

  28. Schmid M, Hothorn T (2008) Boosting additive models using component-wise p-splines. Comput Stat Data Anal 53(2):298–311

    Article  MathSciNet  MATH  Google Scholar 

  29. Shrivastav LK, Jha SK (2021) A gradient boosting machine learning approach in modeling the impact of temperature and humidity on the transmission rate of covid-19 in india. Appl Intell 51:2727–2739

    Article  Google Scholar 

  30. Sigrist F (2021) Gradient and newton boosting for classification and regression. Expert Syst Appl 167:114080

    Article  Google Scholar 

  31. Sun P, Zhang T, Zhou J (2014) A convergence rate analysis for logitboost, mart and their variant. In: International conference on machine learning, PMLR, pp 1251–1259

  32. Telgarsky M, Singer Y (2012) A primal-dual convergence analysis of boosting. J Mach Learn Res 13(3)

  33. Vanschoren J, Van Rijn JN, Bischl B et al (2014) Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15(2):49–60

    Article  Google Scholar 

  34. Verbraeken J, Wolting M, Katzy J et al (2020) A survey on distributed machine learning. Acm Comput Surv 53(2):1–33

    Article  Google Scholar 

  35. Wang Y, Ma X, Chen Z et al (2019) Symmetric cross entropy for robust learning with noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 322–330

  36. Yuan Y (2000) A review of trust region algorithms for optimization. In: ICIAM, pp 271–282

  37. Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Info Process Syst 31

Download references

Funding

This work was partially supported by the National Natural Science Foundation of China no. 12071190.

Author information

Authors and Affiliations

Authors

Contributions

Jiaqi Luo: Methodology, Software, Writing\( - \)original draft, Writing \( - \) review & editing. Zihao Wei: Software, Writing \( - \) original draft. Junkai Man: Software, Writing \( - \) original draft. Shixin Xu: Conceptualization, Methodology, Writing \( - \) original draft, Writing \( - \) review & editing.

Corresponding author

Correspondence to Shixin Xu.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of theorem 2

1.1 A.1 One-instance example

Denote the loss at the current iteration by \(l=l^t(y, F)\) and that at the next iteration by \(l^{+} = l^{t+1}(y, F+f)\). Suppose the steps of gradient descent GBMs, Newton’s GBMs, and TRBoost, are \(-\nu g\), \(\frac{-\nu g}{h}\), and \(\frac{-g}{h+\mu }\), respectively. \(\nu \) is the learning rate and is usually less than 1 to avoid overfitting.

For regression tasks with MAE loss, since the Hessian \(h=0\), we have \(l^{+}=l+gf\). Let \(\mu =\frac{1}{v}\), then TRBoost degenerate into gradient descent and has the same convergence as the gradient descent algorithm, which is \(O\left( \frac{1}{T}\right) \).

Proof

Since \(g=1\) or -1, substituting \(f=-\nu g\) to \(l^{+}=l+gf\), we have

$$\begin{aligned} l^{+}=l-\nu g^2=l-\nu < l. \end{aligned}$$
(A1)

Because l is monotonically decreasing and \(l > 0\), \(\nu \) is a constant, thus there exists a constant \(0<c<1 \) such that

$$\begin{aligned} c(l^t)^2 < \nu , \forall t. \end{aligned}$$
(A2)

Hence we have

$$\begin{aligned} l^{+}=l-\nu < l-cl^2. \end{aligned}$$
(A3)

According to the Theorem 3 in Appendix B, the \(O(\frac{1}{T})\) convergence rate is obtained.\(\square \)

For MSE loss, it is a quadratic function. Hence just let \(\nu =1\) and \(\mu =0\), then follow the standard argument in numerical optimization book [26], both Newton’s method and trust-region method have the quadratic convergence.

We now discuss the convergence when the Hessian is not constant. We prove that the trust-region method has a linear convergence rate like Newton’s method with fewer constraints.

Proof

By the Mean Value Theorem, we have

$$\begin{aligned} l^+ = l + g f + \frac{1}{2}h_{\xi }f^2, \end{aligned}$$
(A4)

\(\xi \in [0, f]\) and \(h_{\xi }\) denotes the Hessian at \(\xi \). Substituting \(f=\frac{-g}{h+\mu }\) to (A4), we have

$$\begin{aligned} l^+ = l-\frac{g^2}{h+\mu } + \frac{1}{2}h_{\xi }\frac{g^2}{(h+\mu )^2}. \end{aligned}$$
(A5)

In paper [31], it needs \(f \ge 0\) to ensure \(\frac{h_{\xi }}{h} \le 1\). But in our method, we do not need this constraint. Since the Hessian \(h=4p(1-p)\) (Definition 1) is bounded by the predicted probability p and a proper \(\mu \) (for example \(\mu =1\)) can always make \(\frac{h_{\xi }}{h+\mu } \le 1\).

By applying Theorem 7 we have

$$\begin{aligned} l^+= & {} l-\frac{g^2}{h+\mu } + \frac{1}{2}\frac{h_{\xi }}{h+\mu }\frac{g^2}{h+\mu } \nonumber \\\le & {} l-\frac{1}{2}\frac{g^2}{h+\mu } \nonumber \\= & {} l-\frac{1}{2}\frac{h}{h+\mu }\frac{g^2}{h} \nonumber \\\le & {} l-\frac{1}{2}\frac{h}{h+\mu }l \nonumber \\= & {} \left( 1-\frac{1}{2}\frac{h}{h+\mu }\right) l. \end{aligned}$$
(A6)

For any \(\mu \) that satisfies the inequality \(\frac{h_{\xi }}{h+\mu } \le 1\) and the Assumption 2, \(0< (1-\frac{1}{2}\frac{h}{h+\mu }) < 1\) always holds, hence we obtain the linear convergence according to Theorem 4.\(\square \)

1.2 A.2 Tree model case

We omit the proof for regression tasks with different loss functions since it is the same as the One-instance case.

Proof

Define the total loss \(L(f)=\sum _{i=1}^{n}l(y_i, F_i+f_i)\). Write it in matrix form and apply the Mean Value Theorem, we have

$$\begin{aligned} L(f) = L(0) + \textbf{g}^{\intercal }\textbf{f} + \frac{1}{2} \textbf{f}^{\intercal }\textbf{H}_{\xi }\textbf{f}, \end{aligned}$$
(A7)

where \(\textbf{g}=(g_1, \cdots , g_{n} )^{\intercal }\), \(\textbf{H}=diag(h_1,\cdots ,h_{n})\) denote the instance wise gradient and Hessian, respectively. In a decision tree, instances falling into the same leaf must share a common fitted value. We can thus write down \(\textbf{f}=\textbf{V}\textbf{s}\), where \( \textbf{s} = (s_1, \cdots , s_j, \cdots , s_J)^{\intercal }\in \mathbb {R}^{J}\) are the fitted values on the J leaves and \(\textbf{V}\in \mathbb {R}^{n\times J}\) is a projection matrix \(\textbf{V}=[\textbf{v}_1, \cdots , \textbf{v}_j, \cdots , \textbf{v}_{J}]\), where \(\textbf{v}_{j, i}=1\) if the jth leaf contains the ith instance and 0 otherwise. Substituting \(\textbf{f}=\textbf{V}\textbf{s}\) to L(f) and reload the notation \(L(\cdot )\) for \(\textbf{s}\) we have

$$\begin{aligned} L(\textbf{s}) = L(0) + (\textbf{g}^{\intercal }\textbf{V})\textbf{s} + \frac{1}{2} \textbf{s}^{\intercal }(\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\textbf{s}. \end{aligned}$$
(A8)

Denote \(L(\textbf{s})\) by \(L^+\) and L(0) by L and replace \(\textbf{f}\) with the trust-region step

$$\begin{aligned} \textbf{f}= & {} -\textbf{V}\textbf{s} \nonumber \\= & {} -\textbf{V}(\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I})^{-1} \textbf{V}^{\intercal } \textbf{g}, \end{aligned}$$
(A9)

then we have

$$\begin{aligned} L^{+}= & {} L-(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g} \nonumber \\{} & {} + \frac{1}{2} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}(\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g}, \end{aligned}$$
(A10)

where \(\hat{\textbf{H}}=\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I}\), I is the identity matrix.

Using the Lemmas 4, 5, and Corollary 1 in the Appendix, we obtain the following inequality

$$\begin{aligned} L^{+}\le & {} L-(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g}\nonumber \\{} & {} + \frac{\tau }{2}(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g})~~~(Lemma~4)\nonumber \\= & {} L-(1-\frac{\tau }{2})(\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}\textbf{V}^{\intercal }\textbf{g} \nonumber \\\le & {} L-\lambda \gamma _{*}^2 (1-\frac{\tau }{2})\textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g} ~~~ (Lemma ~5)\nonumber \\\le & {} L-\lambda \gamma _{*}^2 (1-\frac{\tau }{2})L ~~~(Corollary~1) \nonumber \\= & {} (1-\lambda \gamma _{*}^2 (1-\frac{\tau }{2}))L. \end{aligned}$$
(A11)

Since \(\tau \) and \(\gamma _{*}\) are constants, \(0 < \lambda \le \frac{h^{min}}{n(h^{max}+\mu )}\), a proper \(\lambda \) can make \(0< \lambda \gamma _{*}^2 (1-\frac{\tau }{2}) < 1\) . According to the Theorem 4 in the Appendix, the linear rate is achieved. \(\square \)

Appendix B: Related theory

1.1 B.1 Existing theoretical results

In this subsection, we introduce some related results proposed in the paper [31].

Definition 1

The instance wise loss \(l(\cdot , \cdot )\) is defined as

$$\begin{aligned} l(y, F) = ylog(\frac{1}{p})+(1-y)log(\frac{1}{1-p}), \end{aligned}$$
(B12)

where \(y \in \{0, 1\}\) and the probability estimate \(p \in [0, 1]\) is computed by a sigmoid-like function on F

$$\begin{aligned} p = \psi (F)=\frac{e^F}{e^F+e^{-F}}. \end{aligned}$$
(B13)

And the gradient and Hessian of \(l(\cdot , F)\) are given as

$$\begin{aligned} g(F) = 2(p-y), h(F)=4p(1-p). \end{aligned}$$
(B14)

Assumption 1

In order to avoid the numerical stability problems, we do a clapping on the output probability \(p_i, i=1,\cdots ,n\):

$$\begin{aligned} p_i=\left\{ \begin{array}{l} 1-\rho , (y_i=0)\wedge (p_i > 1-\rho ) \\ \rho , (y_i=1)\wedge (p_i < \rho )\\ p_i, otherwise \end{array} \right. \end{aligned}$$

for some small constant \(\rho \).

Theorem 3

If a sequence \(\epsilon _0 \ge \cdots \ge \epsilon _{t-1} \ge \epsilon _{t} \ge \cdots \ge \epsilon _{T} > 0\) has the recurrence relation \(\epsilon _t \le \epsilon _{t-1}-c\epsilon _{t-1}^2\) for a small constant \(c>0\), then we have \(\frac{1}{T}\) convergence rate: \(\epsilon _T \le \frac{\epsilon _0}{1+c\epsilon _0T}\).

Theorem 4

If a sequence \(\epsilon _0 \ge \cdots \ge \epsilon _{t-1} \ge \epsilon _{t} \ge \cdots \ge \epsilon _{T} > 0\) has the recurrence relation \(\epsilon _t \le c\epsilon _{t-1}\) for a small constant \(0<c<1\), then we have linear convergence rate: \(\epsilon _T \le \epsilon _0 c^T\).

Theorem 5

With the value clapping on \(p_i, i=1,\cdots ,n\), the Newton step \(-\frac{\bar{g}_j}{\bar{h}_j}\) is bounded such that \(|\bar{g}_j /\bar{h}_j| \le 1/(2\rho )\). Here \(\bar{g}_j=\sum _{i\in I_j}g_i\) and \(\bar{h}_j=\sum _{i\in I_j}h_i\) denote the sum of gradient and Hessian in jth tree node.

Theorem 6

For GBMs, it has \(O(\frac{1}{T})\) rate when using gradient descent, while a linear rate is achieved when using Newton descent.

Theorem 7

Let g, h, and l be the shorthand for gradient, Hessian, and loss, respectively. Then \(\forall p\) (and thus \(\forall F\)), the inequality \(\frac{g^2}{h}\ge l\) always holds.

Corollary 1

Connect Newton’s objective reduction and loss reduction using Theorem 7, we have

$$\begin{aligned} \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g} \ge L. \end{aligned}$$
(B15)

Lemma 1

For a set of n instances with labels \(\{0, 1\}^n\) and non-negative weights \(\{w_1, \cdots , w_n\}\), there exists a J-leaf classification tree (i.e., outputting \(\{0, 1\}\) at each leaf) such that the weighted error rate at each leaf is strictly less than \(\frac{1}{2}\) by at least \(\delta > 0\).

Lemma 2

Let \(\textbf{g}=(g_1, \cdots , g_n )^{\intercal }\), \(\textbf{H}=diag(h_1,\cdots ,h_n)\). If the weak Learnability Assumption 1 holds, then there exists a J-leaf regression tree whose projection matrix \(\textbf{V}\in \mathbb {R}^{n\times J}\) satisfies

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }(\textbf{V}^{\intercal }\textbf{H}\textbf{V})(\textbf{V}^{\intercal }\textbf{g}) \ge \gamma _{*}^2 \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g}, \end{aligned}$$
(B16)

where \(\gamma _{*}^2 = \frac{4\delta ^2\rho }{n}\).

Lemma 3

For the current \(F \in \mathbb {R}\) and a step \(f \in \mathbb {R}\), the change of the Hessian can be bounded in terms of step size |f|:

$$\begin{aligned} \frac{h(F+f)}{h(f)} \le e^{2|f|}. \end{aligned}$$
(B17)

1.2 B.2 New theoretical results proposed in this paper

Assumption 2

In the Algorithm 2, we can see that \(\mu ^{t}\) is monotonically increasing, and it will become infinite as the number of iterations increase, which makes the step \(\textbf{p}_k\) go to 0. In order to avoid this situation, in practice we assume \(\mu ^t\) is bounded, i.e., \(\mu ^t \in [\mu ^0, \mu ^{\max }]\), \(\forall t\).

Theorem 8

Since \(\mu \) and Hessian h are bounded, there exists a constant \(\lambda \) that satisfies

$$\begin{aligned} \frac{1}{\sum _{j\in I_j}(h_j+\mu )} \ge \lambda \frac{1}{\sum _{j\in I_j}h_j}, \forall j=1,\cdots ,J. \end{aligned}$$
(B18)

Proof

Let \(h^{\max }\) and \(h^{\min }\) be the maximum and minimum values of the Hessian respectively. The minimum value on the left side of the inequality is \(\frac{1}{n(h^{\max }+\mu )}\), and the maximum value on the right side is \(\frac{1}{h^{\min }}\). Just let

$$\begin{aligned} \lambda \le \frac{h^{\min }}{n(h^{\max }+\mu )}, \end{aligned}$$
(19)

then the inequality always holds.\(\square \)

Lemma 4

Based on Assumption 1, we have

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\hat{\textbf{H}}^{-1} \le \tau I. \end{aligned}$$
(B20)

Proof

For the jth leaf \((j = 1, \cdots , J)\), the node wise Hessian before update is

$$\begin{aligned} \bar{h}_j=\sum _{i \in I_j}(h(F_i)+\mu ) \end{aligned}$$
(B21)

while the mean value is

$$\begin{aligned} \bar{h}_{\xi , j}=\sum _{i=I_j}h(F_i+\eta _j), \end{aligned}$$
(B22)

where \(\eta _j\) lies between 0 and the trust-region step \(\frac{\sum _{i=I_j}g(F_i)}{\sum _{i \in I_j}(h(F_i)+\mu )}\). Applying Lemma 3, we have

$$\begin{aligned} \frac{\bar{h}_{\xi , j}}{\bar{h}_j}= & {} \frac{\sum _{i=I_j}h(F_i+\eta _j)}{\sum _{i \in I_j}(h(F_i)+\mu )} \nonumber \\\le & {} \frac{\sum _{i=I_j}h(F_i+\eta _j)}{\sum _{i \in I_j}h(F_i)} \nonumber \\\le & {} \frac{\sum _{i=I_j}h(F_i)\cdot e^{2\eta _j}}{\sum _{i \in I_j}h(F_i)} \nonumber \\= & {} \exp ^{2\eta _j}. \end{aligned}$$
(B23)

According to Theorem 5, we have

$$\begin{aligned} |\eta _j| \le \frac{\sum _{i\in I_j}g(F_i)}{\sum _{i \in I_j}(h(F_i)+\mu )} \le \frac{\sum _{i\in I_j}g(F_i)}{\sum _{i \in I_j}h(F_i} \le \frac{1}{2\rho }. \end{aligned}$$
(B24)

Rewriting the matrix form, we have

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{H}_{\xi }\textbf{V})\hat{\textbf{H}}^{-1} \le \tau I, \end{aligned}$$
(B25)

where \(\tau = e^{\frac{1}{2\rho }}\).\(\square \)

Lemma 5

Let \(\textbf{g}=(g_1, \cdots , g_n )^{\intercal }\), \(\textbf{H}=diag(h_1,\cdots ,h_n)\), \(\hat{\textbf{H}}=\textbf{V}^{\intercal }\textbf{H}\textbf{V}+\mu \textbf{I}\). If the weak Learnability Assumption 2 holds, then there exists a J-leaf regression tree whose projection matrix \(\textbf{V}\in \mathbb {R}^{n\times J}\) satisfies

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}(\textbf{V}^{\intercal }\textbf{g}) \ge \gamma \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g} \end{aligned}$$
(B26)

for some constant \(\gamma > 0\).

Proof

For each leaf j, according to Lemma 2, we have

$$\begin{aligned} \frac{(\sum _{i\in I_j}g_i)^2}{\sum _{i \in I_j}h_i} \ge \gamma _{*}^2 \sum _{i=I_j}\frac{g_i^2}{h_i} \end{aligned}$$
(B27)

By Theorem 8, for some \(\lambda \), we obtain

$$\begin{aligned} \frac{(\sum _{i\in I_j}g_i)^2}{\sum _{i \in I_j}(h_i+\mu )} \ge \lambda \frac{(\sum _{i\in I_j}g_i)^2}{\sum _{i \in I_j}h_i} \ge \lambda \gamma _{*}^2 \sum _{i=I_j}\frac{g_i^2}{h_i}. \end{aligned}$$
(B28)

Rewriting in the matrix form, we get the final result

$$\begin{aligned} (\textbf{V}^{\intercal }\textbf{g})^{\intercal }\hat{\textbf{H}}^{-1}(\textbf{V}^{\intercal }\textbf{g}) \ge \lambda \gamma _{*}^2 \textbf{g}^{\intercal }\textbf{H}^{-1}\textbf{g}. \end{aligned}$$
(B29)

\(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, J., Wei, Z., Man, J. et al. TRBoost: a generic gradient boosting machine based on trust-region method. Appl Intell 53, 27876–27891 (2023). https://doi.org/10.1007/s10489-023-05000-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05000-w

Keywords

Navigation