Efficient Kirszbraun extension with applications to regression

Hananel Zaichyk¹,
Armin Biess¹,
Aryeh Kontorovich ORCID: orcid.org/0000-0001-8038-8671¹ &
…
Yury Makarychev²

276 Accesses
1 Altmetric
Explore all metrics

A Correction to this article was published on 20 January 2024

This article has been updated

Abstract

We introduce a framework for performing vector-valued regression in finite-dimensional Hilbert spaces. Using Lipschitz smoothness as our regularizer, we leverage Kirszbraun’s extension theorem for off-data prediction. We analyze the statistical and computational aspects of this method—to our knowledge, its first application to supervised learning. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves significant runtime speedups over generic interior point methods. Our empirical results indicate a dramatic advantage over standard off-the-shelf solvers in our regression setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A Survey of Solution Path Algorithms for Regression and Classification Models

Article 25 March 2022

Iteratively reweighted $\ell _1$ algorithms with extrapolation

Article 25 February 2019

The Use of Infinities and Infinitesimals for Sparse Classification Problems

Change history

20 January 2024
A Correction to this paper has been published: https://doi.org/10.1007/s10107-024-02056-5

Notes

Even when the problem is formally infinite-dimensional, such as with SVM, the Representer Theorem [25] shows that the solution is spanned by the finite sample.
As explained in Sect. 3, there is no need to assume that L is given, as this hyper-parameter can be tuned via Structural Risk Minimization (SRM).
A further improvement via the use of spanners allows reducing the number of constraints m from $O(n^2)$ to O(n) and hence the ERM runtime to , as detailed in Sect. 3.1.
https://github.com/HananZaichyk/Kirszbraun-extension.
A spanner of a graph is a sub-graph simultaneously enjoying sparsity and distance-preserving properties. The reader is referred to [17, 22] for the relevant background and in particular, the precise definition of a $(1+\varepsilon )$-stretch spanner and doubling dimension.

References

Alman, J., Williams, V.V.: A refined laser method and faster matrix multiplication. In: Proceedings of the Symposium on Discrete Algorithms, pp. 522–539. SIAM (2021)
Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8(1), 121–164 (2012)
Article MathSciNet Google Scholar
Arya, S., Mount, D.M., Netanyahu, N., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In: Symposium on Discrete Algorithms, pp. 573–582 (1994)
Ashlagi, Y., Gottlieb, L., Kontorovich, A.: Functions with average smoothness: structure, algorithms, and learning. In: Belkin, M., Kpotufe, S. (eds.) Conference on Learning Theory, COLT 2021, 15–19 August 2021, Boulder, Colorado, USA, PMLR, Proceedings of Machine Learning Research, vol. 134, pp. 186–236. http://proceedings.mlr.press/v134/ashlagi21a.html (2021)
Borchani, H., Varando, G., Bielza, C., Larrañaga, P.: A survey on multi-output regression. Wiley Interdiscip Rev Data Min Knowl Discov 5(5), 216–233 (2015)
Article Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Information Science and Statistics. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Brualdi, R.A.: Introductory Combinatorics, 5th edn. Pearson Prentice Hall, Upper Saddle River (2010)
Google Scholar
Brudnak, M.: Vector-valued support vector regression. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp. 1562–1569. IEEE (2006)
Bunch, J.R., Hopcroft, J.E.: Triangular factorization and inversion by fast matrix multiplication. Math. Comput. 28(125), 231–236 (1974)
Article MathSciNet Google Scholar
Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Article MathSciNet Google Scholar
Chen, S., Banerjee, A.: An improved analysis of alternating minimization for structured multi-response regression. In: Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pp. 6617–6628. Curran Associates Inc., USA (2018). http://dl.acm.org/citation.cfm?id=3327757.3327768
Christiano, P., Kelner, J.A., Madry, A., Spielman, D.A., Teng, S.H.: Electrical flows, Laplacian systems, and faster approximation of maximum flow in undirected graphs. In: Proceedings of Symposium on Theory of Computing, pp. 273–282 (2011)
Cohen, D.T., Kontorovich, A.: Learning with metric losses. In: Loh, P., Raginsky, M. (eds.) Conference on Learning Theory, 2–5 July 2022, London, UK, PMLR, Proceedings of Machine Learning Research, vol. 178, pp. 662–700 (2022). https://proceedings.mlr.press/v178/cohen22a.html
Cole, R., Gottlieb, L.A.: Searching dynamic point sets in spaces with bounded doubling dimension. In: Proceedings of the Symposium on Theory of Computing, pp. 574–583 (2006)
Davidson, R., MacKinnon, J.G., et al.: Estimation and Inference in Econometrics. OUP Catalogue (1993)
Gottlieb, L.A., Kontorovich, A., Krauthgamer, R.: Adaptive metric dimensionality reduction (extended abstract: ALT 2013). Theoretical Computer Science, pp. 105–118 (2016)
Gottlieb, L.A., Kontorovich, A., Krauthgamer, R.: Efficient regression in metric spaces via approximate Lipschitz extension. IEEE Trans. Inf. Theory 63(8), 4838–4849 (2017)
Article MathSciNet Google Scholar
Greene, W.H.: Econometric Analysis. Pearson Education India (2003)
Greene, W.H.: Econometric Analysis. William H. Greene (2012)
Györfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, Cham (2006)
Google Scholar
Hanneke, S., Kontorovich, A., Kornowski, G.: Near-optimal learning with average Hölder smoothness. CoRR arXiv:2302.06005 (2023)
Har-Peled, S., Mendel, M.: Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comput. 35(5), 1148–1184 (2006)
Article MathSciNet Google Scholar
Jain, P., Tewari, A.: Alternating minimization for regression problems with vector-valued outputs. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1126–1134. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5820-alternating-minimization-for-regression-problems-with-vector-valued-outputs.pdf
Kakade, S., Tewari, A.: Dudley’s theorem, fat shattering dimension, packing numbers. Lecture 15, Toyota Technological Institute at Chicago (2008). http://ttic.uchicago.edu/~tewari/lectures/lecture15.pdf
Kimeldorf, G.S., Wahba, G.: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat. 41(2), 495–502 (1970). https://doi.org/10.1214/aoms/1177697089
Article MathSciNet Google Scholar
Kirszbraun, M.: Über die zusammenziehende und Lipschitzsche transformationen. Fundam. Math. 22(1), 77–108 (1934)
Article Google Scholar
Koutis, I., Miller, G.L., Peng, R.: A fast solver for a class of linear systems. Commun. ACM 55(10), 99–107 (2012)
Article Google Scholar
Mahabadi, S., Makarychev, K., Makarychev, Y., Razenshteyn, I.: Nonlinear dimension reduction via outer bi-lipschitz extensions. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1088–1101. ACM (2018)
McShane, E.J.: Extension of range of functions. Bull. Am. Math. Soc. 40(12), 837–842 (1934). https://projecteuclid.org:443/euclid.bams/1183497871
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations Of Machine Learning. The MIT Press, Cambridge (2012)
Google Scholar
Nadaraya, E.A.: Nonparametric Estimation of Probability Densities and Regression Curves. Springer, Cham (1989)
Book Google Scholar
Naor, A.: Metric embeddings and Lipschitz extensions (2015)
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)
Book Google Scholar
Rudin, W.: Principles of Mathematical Analysis. International Series in Pure and Applied Mathematics, 3rd edn. McGraw-Hill Book Co., New York (1976)
Google Scholar
Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the Symposium on Theory of Computing, vol. 4 (2004)
Whitney, H.: Analytic extensions of differentiable functions defined in closed sets. Trans. Am. Math. Soc. 36(1), 63–89 (1934). http://www.jstor.org/stable/1989708

Download references

Acknowledgements

AK was partially supported by the Israel Science Foundation (Grant No. 1602/19), the Ben-Gurion University Data Science Research Center, and an Amazon Research Award. HZ was an MSc student at Ben-Gurion University of the Negev during part of this research. YM was partially supported by NSF awards CCF-1718820, CCF-1955173, and CCF-1934843.

Author information

Authors and Affiliations

Ben-Gurion University of the Negev, Be’er Sheva, Israel
Hananel Zaichyk, Armin Biess & Aryeh Kontorovich
Toyota Technological Institute at Chicago, Chicago, IL, USA
Yury Makarychev

Authors

Hananel Zaichyk
View author publications
You can also search for this author in PubMed Google Scholar
Armin Biess
View author publications
You can also search for this author in PubMed Google Scholar
Aryeh Kontorovich
View author publications
You can also search for this author in PubMed Google Scholar
Yury Makarychev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aryeh Kontorovich.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: Appendix E contained response of the referees, but it was inadvertently published. Now, it has been removed.

Appendices

Appendix A: The Arora–Hazan–Kale result

Intuitive explanation of the result Consider a feasibility problem over a domain $\mathscr {P}\in \mathbb {R}^n$ in which you need to determine if there exists $x\in \mathscr {P}$ which satisfies finite set of constrains: $f_i(\textbf{X}) \ge 0$ for all $i\in [m]$. This problem in the general case is NP-Hard. The Arora-Hazan-Kale result indicates that, under certain conditions, if we know how to solve approximately a simpler problem: $\exists ?\textbf{X}\in \mathscr {P}:\> \sum _i w_i f_i(\textbf{X}) \ge 0$ where $\sum _i w_i = 1$, by an algorithm which we call “Oracle”, then we can run T iterations over the simpler problem, where in each iteration we update the weights $\{w_i\}_{i\in [m]}$ using the MWU framework, and solve the updated problem using the Oracle. If a solution $x_i\in \mathscr {P}$ exists for all iterations then $\textbf{X}^* = \frac{1}{T}\sum _{t=1}^T \textbf{X}^{(t)}$ will be an approximate solution to the feasible problem. The conditions and the definitions of the meaning in the intuitive explanation are given in “Appendix A”. In this section, we present our algorithm, show our problem fits the paradigm of [2, Sec. 3.3.1, p. 137], show our oracle solves the simpler problem (given in the algorithm) and proof that our algorithms solve the feasible problems efficiently. For completeness, we quote here the relevant definitions and results from [2, Sec. 3.3.1, p. 137].

Consider the following feasibility problem. Let $\mathscr {P}\in \mathbb {R}^n$ be a convex domain in $\mathbb {R}^n$ and $f_i:\mathscr {P}\rightarrow \mathbb {R}$ be concave functions for $i\in [m]$. The goal is to determine if there exists $x\in \mathscr {P}$ such that $f_i(\textbf{X}) \ge 0$ for all $i\in [m]$:

$$\begin{aligned} \exists ?\textbf{X}\in \mathscr {P}:\> \forall i\in [m]:\> f_i(\textbf{X}) \ge 0, \end{aligned}$$

(3)

The multiplicative weights update method of Arora, Hazan, and Kale [2] provides an algorithm that satisfies (3) approximately, up to an additive error of $\varepsilon $. We assume the existence of an oracle that given a probability distribution $\textbf{w} = (w_1,w_2,\ldots ,w_m)$ solves the following feasibility problem:

$$\begin{aligned} \exists ?\textbf{X}\in \mathscr {P}:\> \sum _i w_i f_i(\textbf{X}) \ge 0. \end{aligned}$$

(4)

Definition 1

We say that oracle Oracle is $(\ell ,\rho )$-bounded if it always returns $x\in \mathscr {P}$ such that $f_i(\textbf{X}) \in [-\rho ,\ell ]$ for all $\in [m]$. The width of the oracle is $\rho +\ell $.

Remark 1

Note that if $f_i(\textbf{X}) \in [-\rho ,\ell ]$ for all $\textbf{X}\in \mathscr {P}$ then every oracle for the problem is $(\ell ,\rho )$-bounded.

Definition 2

We say that oracle Oracle is $\varepsilon $-approximate if given $\textbf{w} = (w_1,\dots , w_m)$ it either finds a solution $\textbf{X}\in \mathscr {P}$ such that $\sum _i w_i f_i \ge - \varepsilon $ for all $i\in [m]$ or correctly concludes that (4) has no feasible solution.

Consider the following algorithm.

We now state theorems providing performance guarantees for Algorithm 4. Theorems 5 and 6 are for the cases where we have an exact and $\varepsilon $-approximate oracles, respectively.

Theorem 5

(Theorem 3.4 in [2], restated) Let $\varepsilon >0$ be a given error parameter. Suppose that there exists an $(\ell ,\rho )$-bounded Oracle for the feasibility problem (3) with $\ell \ge \varepsilon /2$. Then Algorithm 4 either

solves problem (3) up to an additive error of $\varepsilon $; that is, finds a solution $\textbf{X}^*\in \mathscr {P}$ such that $f_i(\textbf{X}^*) \ge -\varepsilon $ for all $i\in [m]$,
or correctly concludes that problem (3) is infeasible,

making only $O(\ell \rho \log (m)/\varepsilon ^2)$ calls to the Oracle, with an additional processing time of O(m) per call.

Theorem 6

(Theorem 3.5 in [2], restated) Let $\varepsilon >0$ be a given error parameter. Suppose that there exists an $(\ell ,\rho )$-bounded $(\varepsilon /3)$-approximate Oracle for the feasibility problem (3) with $\ell \ge \varepsilon /2$. Consider Algorithm 4 with adjusted parameters $\eta = \frac{\varepsilon }{6\ell }$ and $T = \lceil \frac{18 \rho \ell \ln m}{\varepsilon ^2}\rceil $. Then the algorithm either

solves problem (3) up to an additive error of $\varepsilon $; that is, finds a solution $\textbf{X}^*\in \mathscr {P}$ such that $f_i(\textbf{X}^*) \ge -\varepsilon $ for all $i\in [m]$,
or correctly concludes that problem (3) is infeasible,

making only $O(\ell \rho \log (m)/\varepsilon ^2)$ calls to the Oracle, with an additional processing time of O(m) per call.

Appendix B: Approximate oracle

This section is a continuation to Sect. 3.1, the analysis of the LipschitzSmooth algorithm in Theorem 1. To use the MWU method (see Theorem 6, Theorem 3.5 in [2]), we design an approximate oracle for the following problem.

Problem 3

Given non-negative edge weights $w_{\Phi }$ and $w_{ij}$, which add up to 1, find $\widetilde{\textbf{Y}}$ such that

$$\begin{aligned} w_{\Phi } h_{\Phi }(\widetilde{\textbf{Y}}) + \sum _{(i,j)\in E} w_{(i,j)} h_{(i,j)}(\widetilde{\textbf{Y}}) \ge 0. \end{aligned}$$

(5)

If Problem 3 has a feasible solution, the oracle finds a solution $\widetilde{\textbf{Y}}$ such that

$$\begin{aligned} w_{\Phi } h_{\Phi }(\widetilde{\textbf{Y}}) + \sum _{(i,j)\in E} w_{(i,j)} h_{(i,j)}(\widetilde{\textbf{Y}}) \ge - \varepsilon . \end{aligned}$$

(6)

Let $\mu _{ij} = \frac{w_{ij} + \varepsilon /(m+1)}{M^2\Vert x_i - x_j\Vert ^2}$ and $\lambda _i = \lambda = (w_{\Phi } + \varepsilon /(m+1))/\Phi _0$. We solve Laplace’s problem with parameters $\mu _{ij}$ and $\lambda _i$ (see Sect. 1 and Line 9 of the algorithm). We get a matrix $\widetilde{\textbf{Y}} = (\tilde{y}_1,\dots , \tilde{y}_n)$ minimizing

$$\begin{aligned} \lambda \sum _{i=1}^n \Vert y_i - \tilde{y}_i\Vert ^2 + \sum _{(i,j)\in E}\mu _{ij} \Vert \tilde{y}_i - \tilde{y}_j\Vert ^2. \end{aligned}$$

Consider the optimal solution $\tilde{y}_1^*,\dots , \tilde{y}_n^*$ for Lipschitz Smoothing. We have

$$\begin{aligned} \lambda \sum _{i=1}^n \Vert y_i - \tilde{y}_i\Vert ^2 + \sum _{(i,j)\in E}\mu _{ij} \Vert \tilde{y}_i - \tilde{y}_j\Vert ^2&\le \lambda \sum _{i=1}^n \Vert y_i - \tilde{y}_i^*\Vert ^2 + \sum _{(i,j)\in E} \mu _{ij} \Vert \tilde{y}_i^* - \tilde{y}_j^*\Vert ^2 \end{aligned}$$

(7)

$$\begin{aligned}&\le (w_{\Phi } + \varepsilon /(m+1))\nonumber \\&\quad + \sum _{(i,j)\in E} \left( w_{ij} + \frac{1}{m+1}\right) \frac{\Vert \tilde{y}_i^* - \tilde{y}_j^*\Vert }{M^2 \Vert x_i - x_j\Vert ^2} \end{aligned}$$

(8)

$$\begin{aligned}&\le \left( w_{\Phi }+\sum _{(i,j)\in E} w_{ij}\right) + \varepsilon = 1 + \varepsilon . \end{aligned}$$

(9)

We verify that $\widetilde{\textbf{Y}}$ is a feasible solution for Problem 3. We have

$$\begin{aligned}&1 - \left( w_{\Phi } h_{\Phi }(\widetilde{\textbf{Y}}) + \sum _{(i,j)\in E} w_{(i,j)} h_{(i,j)}(y)\right) = w_{\Phi }(1 - h_{\Phi })\\&\qquad + \sum _{(i,j)\in E} w_{(i,j)} (1 - h_{(i,j)}(y))\\&\quad \le \sqrt{w_{\Phi }(1 - h_{\Phi })^2 + \sum _{(i,j)\in E} w_{(i,j)} (1 - h_{(i,j)}(y))^2}\\&\quad = \sqrt{w_{\Phi } \frac{\Phi (\textbf{Y}, \widetilde{\textbf{Y}})}{\Phi _0} + \sum _{(i,j)\in E} w_{(i,j)} \frac{\Vert \tilde{y}_i - \tilde{y}_j\Vert ^2}{M^2\Vert x_i - x_j\Vert ^2}}\\&\quad \le \sqrt{\lambda \Phi (\textbf{Y}, \widetilde{\textbf{Y}})+ \sum _{(i,j)\in E} w_{(i,j)} \mu _{ij}\Vert \tilde{y}_i - \tilde{y}_j\Vert ^2}\\&\quad \le \sqrt{1+\varepsilon } \le 1 + \varepsilon , \end{aligned}$$

as required.

Finally, we bound the width of the problem. We have $h_{\Phi }(\widetilde{\textbf{Y}}) \le 1$ and $h_{ij}(\widetilde{\textbf{Y}}) \le 1$. Then, using (7), we get

$$\begin{aligned} (1 - h_{\Phi }(\widetilde{\textbf{Y}}))^2 = \frac{1}{\Phi _0}\sum _{i=1}^n \Vert y_i - \tilde{y}_i\Vert ^2 \le \frac{1+\varepsilon }{\lambda \Phi _0} \le \frac{(1+\varepsilon )(m+1)}{\varepsilon }. \end{aligned}$$

Therefore, $-h_{\Phi }(\widetilde{\textbf{Y}}) \le O(\sqrt{m/\varepsilon })$.

Similarly,

$$\begin{aligned} (1 - h_{ij}(\widetilde{\textbf{Y}}))^2 = \frac{\Vert y_i - \tilde{y}_i\Vert ^2}{M^2\Vert x_i - x_j\Vert ^2} \le \frac{1+\varepsilon }{\mu _{ij} \cdot M^2\Vert x_i - x_j\Vert ^2} \le \frac{(1+\varepsilon )(m+1)}{\varepsilon }. \end{aligned}$$

Therefore, $-h_{ij}(\widetilde{\textbf{Y}}) \le O(\sqrt{m/\varepsilon })$.

Appendix C: Generalization bounds

Recall the following statistical setting of Sect. 3. We are given a labeled sample $(x_i,y_i)_{i\in [n]}$, where $x_i\in X:=\mathbb {R}^a$ and $y_i\in Y:=\mathbb {R}^b$. For a user-specified Lipschitz constant $L>0$, we compute the (approximate) Empirical Risk Minimizer (ERM) $\hat{f}:=\hbox {argmin}_{f\in F_L}\hat{R}_n(f)$ over $F_L:=\{ f\in Y^X: \left\| f \right\| _{\text {{\tiny Lip }}}\le L \}$. A standard method for tuning L is via Structural Risk Minimization (SRM): One computes a generalization bound $R(\hat{f})\le \hat{R}_n(\hat{f})+Q_n(a,b,L)$, where $Q_n(a,b,L):=\sup _{f\in F_L}|R(f)-\hat{R}_n(f)|=O(L/n^{a+b+1})$, and chooses $\hat{L}$ to minimize this. In this section, we will derive the aforementioned bound.

Let $ B_X \subset \mathbb {R}^k $ and $B_Y \subset \mathbb {R}^\ell $ be the unit balls of their respective Hilbert spaces (each endowed with the $\ell _2$ norm $||\cdot ||$ and corresponding inner product) and $\mathscr {H}_L \subset {B_Y}^{B_X}$ be the set of all L-Lipschitz mappings from $B_X$ to $B_Y$. In particular, every $h\in \mathscr {H}_L$ satisfies

$$\begin{aligned} || h(x)-h(x')|| \le L||x-x'||, \qquad x,x'\in B_X \subset X. \end{aligned}$$

Let $\mathscr {F}_L\subset \mathbb {R}^{B_X \times B_Y}$ be the loss class associated with $\mathscr {H}_L$:

$$\begin{aligned} \mathscr {F}_L= \{B_X \times B_Y \ni (x,y) \mapsto f(x,y)=f_h(x,y):=|| h(x)-y||; h\in \mathscr {H}_L \}. \end{aligned}$$

In particular, as every $f\in \mathscr {F}_L$ satisfies $0\le f\le 2$.

Our goal is to bound the Rademacher complexity of $\mathscr {F}_L$. We do this via a covering numbers approach:

The empirical Rademacher complexity of a collection of functions $\mathscr {F}$, mapping some set ${\textbf {A}} = \{a_1,\dots ,a_n\}\subset A^n$ to $\mathbb {R}$ is defined by:

$$\begin{aligned} \hat{\mathscr {R}}(\mathscr {F};A)= \mathbb {E}\left[ \sup _{f\in \mathscr {F}} \frac{1}{n} \sum _{i=1}^n \sigma _i f(a_i)\right] . \end{aligned}$$

(10)

where $\sigma _1, \sigma _2, \dots , \sigma _m$ are independent random variables drawn from the Rademacher distribution: $\Pr (\sigma _{i}=+1)=\Pr (\sigma _{i}=-1)=1/2$. Recall the relevance of Rademacher complexities to uniform deviation estimates for the risk functional $R(\cdot )$ [30, Theorem 3.1]: for every $\delta >0$, with probability at least $1-\delta $, for each $h\in \mathscr {H}_L$:

$$\begin{aligned} R(h(z))\le \hat{R}_n(h(z)) + 2\hat{\mathscr {R}}_n(\hat{\mathscr {F}_L}) + 6 \sqrt{\frac{\ln (2/\delta )}{2n}}. \end{aligned}$$

(11)

Define $Z=B_X\times B_Y$ and endow it with the norm $\left\| (x,y) \right\| _Z=\left\| x \right\| +\left\| y \right\| $; note that $(Z,\left\| \cdot \right\| _Z)$ is a Banach but not a Hilbert space. First, we observe that the functions in $\mathscr {F}_L$ are Lipschitz under $\left\| \cdot \right\| _Z$. Indeed, choose any $f=f_h\in \mathscr {F}_L$ and $x,x'\in B_X$, $y,y'\in B_Y$. Then

$$\begin{aligned} \left| f_h(x,y)-f_h(x',y') \right|= & {} \left| \left\| h(x)-y \right\| - \left\| h(x')-y' \right\| \right| \\\le & {} \left\| ( h(x)-y) - ( h(x')-y') \right\| \\\le & {} \left\| h(x)-h(x') \right\| + \left\| y-y' \right\| \\\le & {} L\left\| x-x' \right\| +\left\| y-y' \right\| \\\le & {} (L\vee 1)\left\| (x,y)-(x',y') \right\| _\mathscr {Z}, \end{aligned}$$

where $a\vee b:=\max \left\{ a,b\right\} $. We conclude that any $f\in \mathscr {F}_L$ is $(L\vee 1) < (L+1) $-Lipschitz under $\left\| \cdot \right\| _Z$.

Since we restricted the domain and range of $\mathscr {H}_L$, respectively, to the unit balls $B_X$ and $B_Y$, the domain of $\mathscr {F}_L$ becomes $B_Z:=B_X\times B_Y$ and its range is [0, 2]. Let us recall some basic facts about the $\ell _2$ covering of the k-dimensional unit ball

$$\begin{aligned} \mathscr {N}(t,B_\mathscr {X},\left\| \cdot \right\| ) \le (3/t)^k; \end{aligned}$$

an analogous bound holds for $\mathscr {N}(t,B_Y,\left\| \cdot \right\| )$. Now if $\mathscr {C}_X$ is a collection of balls, each of diameter at most t, that covers $B_X$ and $\mathscr {C}_Y$ is a similar collection covering $B_Y$, then clearly the collection of sets

$$\begin{aligned} \mathscr {C}_Z:=\left\{ E=F\times G \subset \mathscr {Z}: F\in \mathscr {C}_X, G\in \mathscr {C}_Y\right\} \end{aligned}$$

covers $B_Z$. Moreover, each $E\in \mathscr {C}_Z$ is a ball of diameter at most 2t in $(Z,\left\| \cdot \right\| _Z)$. It follows that

$$\begin{aligned} \mathscr {N}(t,B_Z,\left\| \cdot \right\| _Z) \le (6/t)^{2k}. \end{aligned}$$

Finally, we endow $F_L$ with the $\ell _\infty $ norm, and use a Kolmogorov–Tihomirov type covering estimate (see, e.g., [16, Lemma 4.2]):

$$\begin{aligned} \log \mathscr {N}(t,F_L,\left\| \cdot \right\| _\infty ) \le (96(L+1)/t)^{2k}\log (8/t). \end{aligned}$$

Finally, we invoke a standard result bounding the Rademacher complexity in terms of the covering numbers via the so-called Dudley entropy integral [24],

$$\begin{aligned} \hat{\mathscr {R}}_n(F_L) \le \inf _{\alpha \ge 0}\left( 4\alpha +12\int _{\alpha }^\infty \sqrt{\frac{\log \mathscr {N}(t,F_L,\left\| \cdot \right\| _\infty )}{n} } dt \right) . \end{aligned}$$

(12)

The estimate in (12) is computed, e.g., in [16, Theorem 4.3]:

$$\begin{aligned} \hat{\mathscr {R}}_n(F_L; \mathscr {Z}) = O\left( \frac{L}{n^{1/(d+1)}}\right) . \end{aligned}$$

(13)

Putting $d = k+\ell $ and combining (12) with (11) yields our generalization bound: with probability at least $1-\delta $,

$$\begin{aligned} R(h(z))\le \hat{R}_n(h(z)) + 6 \sqrt{\frac{\ln (2/\delta )}{2n}} +O\left( \frac{L}{n^{1/(d+1)}}\right) . \end{aligned}$$

(14)

Appendix D: Additional experiments

For completeness we add here the comparison of the results from the experiment for $f(x) = \sin (x)$ for $ x\in [-2\pi ,2\pi ]$ (Tables 7, 8, 9, 10, 11 and Fig. 1).

Table 7 ERM of the smoothing process

Full size table

Table 8 Cross validation running time over the smoothing process in seconds

Full size table

Table 9 Single$^*$ smoothing process running time in seconds

Full size table

Table 10 Extension avg loss

Full size table

Table 11 Extension running time for a all test set in seconds

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zaichyk, H., Biess, A., Kontorovich, A. et al. Efficient Kirszbraun extension with applications to regression. Math. Program. 207, 617–642 (2024). https://doi.org/10.1007/s10107-023-02023-6

Download citation

Received: 01 March 2022
Accepted: 21 September 2023
Published: 07 December 2023
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10107-023-02023-6

Efficient Kirszbraun extension with applications to regression

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Survey of Solution Path Algorithms for Regression and Classification Models

Iteratively reweighted \(\ell _1\) algorithms with extrapolation

The Use of Infinities and Infinitesimals for Sparse Classification Problems

Change history

20 January 2024

Notes

References

Acknowledgements