Abstract
Hybrid switching—in which a high bandwidth circuit switch (optical or wireless) is used in conjunction with a low bandwidth packet switch—is a promising alternative to interconnect servers in today’s large-scale data centers. Circuit switches offer a very high link rate, but incur a non-trivial reconfiguration delay which makes their scheduling challenging. In this paper, we demonstrate a lightweight, simple and nearly optimal scheduling algorithm that trades off reconfiguration costs with the benefits of reconfiguration that match the traffic demands. Seen alternatively, the algorithm provides a fast and approximate solution toward a constructive version of Carathéodory’s Theorem for the Birkhoff polytope. The algorithm also has strong connections to submodular optimization, achieves a performance at least half that of the optimal schedule and strictly outperforms the state of the art in a variety of traffic demand settings. These ideas naturally generalize: we see that indirect routing leads to exponential connectivity; this is another phenomenon of the power of multi-hop routing, distinct from the well-known load balancing effects.
Similar content being viewed by others
Notes
Servers with 10 Gbps network interfaces are common today and 40/100 Gbps servers are being deployed.
Our work is orthogonal to how the controller obtains the traffic demand estimate. For example, the nodes could simply report their backlogs before each scheduling window, or a more sophisticated prediction algorithm could be used.
Note that in addition to the matching edges E, G also contains edges connecting nodes representing the same server across multiple matching rounds.
This also implies the LP can be solved in polynomial time using the Ellipsoidal method [27].
Despite fast algorithms to compute the schedule under fixed matchings, it is not clear how to perform the search over the space of matching sequences efficiently.
This requirement can be accommodated in practice since networks are often severely over-provisioned.
Indirect routing in a distributed setting but without consideration of switch reconfiguration delay was studied in a recent work [10].
References
Ahuja, R., Magnanti, T., Orlin, J.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall (1993). https://books.google.com/books?id=WnZRAAAAMAAJ
Al-Fares, M., Radhakrishnan, S., Raghavan, B., Huang, N., Vahdat, A.: Hedera: dynamic flow scheduling for data center networks. NSDI 10, 19–19 (2010)
Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.: Data center TCP (DCTCP). In: SIGCOMM (2011)
Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8(1), 121–164 (2012)
Azar, Y., Gamzu, I.: Efficient submodular function maximization under linear packing constraints. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) Automata, Languages, and Programming, pp. 38–50. Springer, Berlin, Heidelberg (2012)
Barman, S.: Approximating Nash equilibria and dense bipartite subgraphs via an approximate version of Carathéodory’s theorem. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 361–369. ACM (2015)
Barnhart, C., Sheffi, Y.: A network-based primal-dual heuristic for the solution of multicommodity network flow problems. Transp. Sci. 27(2), 102–117 (1993)
Benson, T., Akella, A., Maltz, D.A.: Network traffic characteristics of data centers in the wild. In: SIGCOMM (2010)
Bienstock, D., Chopra, S., Günlük, O., Tsai, C.Y.: Minimum cost capacity installation for multicommodity network flows. Math. Program. 81(2), 177–199 (1998)
Cao, Z., Kodialam, M., Lakshman, T.: Joint static and dynamic traffic scheduling in data center networks. In: INFOCOM (2014)
Chang, C.S., Chen, W.J., Huang, H.Y.: Birkhoff–von Neumann input buffered crossbar switches. In: INFOCOM (2000)
Chang, C.S., Lee, D.S., Jou, Y.S.: Load balanced Birkhoff–von Neumann switches. In: 2001 IEEE Workshop on High Performance Switching and Routing, pp. 276–280. IEEE (2001)
Dasylva, A., Srikant, R.: Optimal wdm schedules for optical star networks. IEEE/ACM Trans. Netw. (TON) 7(3), 446–456 (1999)
Duan, R., Pettie, S.: Linear-time approximation for maximum weight matching. J. ACM 61(1), 1:1–1:23 (2014). doi:10.1145/2529989
Duan, R., Su, H.H.: A scaling algorithm for maximum weight matching in bipartite graphs. In: SODA (2012)
Farrington, N.: Optics in data center network architecture. Ph.D. thesis, Citeseer (2012)
Farrington, N., Porter, G., Radhakrishnan, S., Bazzaz, H.H., Subramanya, V., Fainman, Y., Papen, G., Vahdat, A.: Helios: a hybrid electrical/optical switch architecture for modular data centers. In: SIGCOMM (2011)
Felzenszwalb, P.F., Zabih, R.: Dynamic programming and graph algorithms in computer vision. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 721–740 (2011)
Ferreira, R.P.M., Luna, H.P.L., Mahey, P., Souza, M.C.D.: Global optimization of capacity expansion and flow assignment in multicommodity networks. Pesquisa Operacional 33(2), 217–234 (2013)
Fleischer, L.K.: Approximating fractional multicommodity flow independent of the number of commodities. SIAM J. Discrete Math. 13(4), 505–520 (2000)
Fredman, M.L., Tarjan, R.E.: Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM (JACM) 34(3), 596–615 (1987)
Fu, S., Wu, B., Jiang, X., Pattavina, A., Zhang, L., Xu, S.: Cost and delay tradeoff in three-stage switch architecture for data center networks. In: HPSR (2013)
Garg, N., Koenemann, J.: Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37(2), 630–652 (2007)
Giaccone, P., Prabhakar, B., Shah, D.: Randomized scheduling algorithms for high-aggregate bandwidth switches. IEEE J. Sel. Areas Commun. 21(4), 546–559 (2003)
Gopal, I.S., Wong, C.K.: Minimizing the number of switchings in an SS/TDMA system. IEEE Trans. Commun. 33(6), 497–501 (1985)
Greenberg, A., Lahiri, P., Maltz, D.A., Patel, P., Sengupta, S.: Towards a next generation data center architecture: scalability and commoditization. In: Proceedings of the ACM Workshop on Programmable Routers for Extensible Services of Tomorrow, pp. 57–62. ACM (2008)
Grötschel, M., Lovász, L., Schrijver, A.: The ellipsoid method and its consequences in combinatorial optimization. Combinatorica 1(2), 169–197 (1981)
Grötschel, M., Lovász, L., Schrijver, A.: Geometric Algorithms and Combinatorial Optimization. Algorithms and Combinatorics. Springer (1993). https://books.google.com/books?id=agLvAAAAMAAJ
Hamedazimi, N., Qazi, Z., Gupta, H., Sekar, V., Das, S.R., Longtin, J.P., Shah, H., Tanwer, A.: Firefly: a reconfigurable wireless data center fabric using free-space optics. In: SIGCOMM (2014)
Hoory, S., Linial, N., Wigderson, A.: Expander graphs and their applications. Bull. Am. Math. Soc. 43(4), 439–561 (2006)
Inukai, T.: An efficient SS/TDMA time slot assignment algorithm. IEEE Trans. Commun. 27(10), 1449–1455 (1979)
Kandula, S., Padhye, J., Bahl, P.: Flyways to de-congest data center networks. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) Springer, Berlin, Heidelberg (2009)
Keslassy, I., Chang, C.S., McKeown, N., Lee, D.S.: Optimal load-balancing. In: INFOCOM (2005). https://www.microsoft.com/en-us/research/publication/flyways-to-de-congest-data-center-networks/
Keslassy, I., Kodialam, M., Lakshman, T., Stiliadis, D.: On guaranteed smooth scheduling for input-queued switches. In: INFOCOM (2003)
Keslassy, I., Zhang-Shen, R., McKeown, N.: Maximum size matching is unstable for any packet switch. IEEE Commun. Lett. 7(10), 496–498 (2003)
Leighton, T., Makedon, F., Plotkin, S., Stein, C., Tardos, É., Tragoudas, S.: Fast approximation algorithms for multicommodity flow problems. J. Comput. Syst. Sci. 50(2), 228–243 (1995)
Li, X., Hamdi, M.: On scheduling optical packet switches with reconfiguration delay. IEEE J. Sel. Areas Commun. 21(7), 1156–1164 (2003)
Li, Y., Panwar, S., Chao, H.J.: Frame-based matching algorithms for optical switches. In: 2003 HPSR Workshop on High Performance Switching and Routing, pp. 97–102. IEEE (2003)
Liu, H., Lu, F., Forencich, A., Kapoor, R., Tewari, M., Voelker, G.M., Papen, G., Snoeren, A.C., Porter, G.: Circuit switching under the radar with reactor. In: NSDI (2014)
Liu, H., Mukerjee, M.K., Li, C., Feltman, N., Papen, G., Savage, S., Seshan, S., Voelker, G.M., Andersen, D.G., Kaminsky, M., Porter, G., Snoeren, A.C.: Scheduling techniques for hybrid circuit/packet networks. In: ACM CoNEXT (2015)
Mahey, P., Benchakroun, A., Boyer, F.: Capacity and flow assignment of data networks by generalized benders decomposition. J. Global Optim. 20(2), 169–189 (2001)
McKeown, N.: The iSLIP scheduling algorithm for input-queued switches. IEEE/ACM Trans. Netw. 7(2), 188–201 (1999)
McKeown, N., Mekkittikul, A., Anantharam, V., Walrand, J.: Achieving 100% throughput in an input-queued switch. IEEE Trans. Commun. 47(8), 1260–1267 (1999)
Mekkittikul, A., McKeown, N.: A practical scheduling algorithm to achieve 100% throughput in input-queued switches. In: INFOCOM (1998)
Mirrokni, V., Leme, R.P., Vladu, A., Wong, S.C.W.: Tight bounds for approximate Carathéodory and beyond. arXiv preprint arXiv:1512.08602 (2015)
Pettie, S., Sanders, P.: A simpler linear time 2/3-\(\varepsilon \) approximation for maximum weight matching. Inf. Process. Lett. 91(6), 271–276 (2004)
Pinsker, M.S.: On the complexity of a concentrator. In: 7th International Telegraffic Conference, Vol. 4, pp. 1–318. Citeseer (1973)
Porter, G., Strong, R., Farrington, N., Forencich, A., Chen-Sun, P., Rosing, T., Fainman, Y., Papen, G., Vahdat, A.: Integrating microsecond circuit switching into the data center. In: SIGCOMM (2013)
Prabhakar, B., McKeown, N.: On the speedup required for combined input-and output-queued switching. Automatica 35(12), 1909–1920 (1999)
Rabin, M.O.: Efficient dispersal of information for security, load balancing, and fault tolerance. J. ACM (JACM) 36(2), 335–348 (1989)
Roy, A., Zeng, H., Bagga, J., Porter, G., Snoeren, A.C.: Inside the social network’s (datacenter) network. In: SIGCOMM (2015)
Schrijver, A.: Combinatorial Optimization—Polyhedra and Efficiency. Springer, Berlin (2003)
Shieh, A., Kandula, S., Greenberg, A.G., Kim, C.: Seawall: performance isolation for cloud datacenter networks. In: HotCloud (2010)
Singla, A., Singh, A., Chen, Y.: OSA: an optical switching architecture for data center networks with unprecedented flexibility. In: NSDI (2012)
Srikant, R., Ying, L.: Communication Networks: An Optimization, Control and Stochastic Networks Perspective. Cambridge University Press, New York, NY (2014)
Towles, B., Dally, W.J.: Guaranteed scheduling for switches with configuration overhead. IEEE/ACM Trans. Netw. 11(5), 835–847 (2003)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Wang, C.H., Javidi, T.: Adaptive policies for scheduling with reconfiguration delay: an end-to-end solution for all-optical data centers. ArXiv preprint arXiv:1511.03417 (2015)
Wang, C.H., Javidi, T., Porter, G.: End-to-end scheduling for all-optical data centers. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 406–414. IEEE (2015)
Wang, G., Andersen, D.G., Kaminsky, M., Papagiannaki, K., Ng, T., Kozuch, M., Ryan, M.: c-through: part-time optics in data centers. In: SIGCOMM (2011)
Wu, B., Yeung, K.L.: Nxg05-6: minimum delay scheduling in scalable hybrid electronic/optical packet switches. In: GLOBECOM (2006)
Wu, B., Yeung, K.L., Wang, X.: Nxg06-4: improving scheduling efficiency for high-speed routers with optical switch fabrics. In: GLOBECOM (2006)
Zhou, X., Zhang, Z., Zhu, Y., Li, Y., Kumar, S., Vahdat, A., Zhao, B.Y., Zheng, H.: Mirror mirror on the ceiling: flexible wireless links for data centers. In: SIGCOMM (2012)
Acknowledgements
The authors would like to thank Prof. Chandra Chekuri, Prof. George Porter and Prof. R. Srikant for the many helpful discussions. This work was partially supported by NSF Grants CCF-1409106, NeTS-1718270 and Army Grant W911NF-14-1-0220.
Author information
Authors and Affiliations
Corresponding author
Appendix: Direct routing—proofs
Appendix: Direct routing—proofs
1.1 Proof of Theorem 1
Proof
We first note that the throughput of an empty schedule is zero, i.e., \(f(\{\}) = 0\). Also, for any \(S \subseteq S' \in 2^{\mathbb {Z}\times \mathcal {M}}\), we have \(\min \left\{ \sum _{(\alpha ,M) \in S}\alpha M, T\right\} \le \min \left\{ \sum _{(\alpha ,M) \in S'}\alpha M, T\right\} \), implying \(f(S) \le f(S')\). Hence, f is normalized and monotone. Next, using the identify
for nonnegative reals a, b, c, we have, for \(S\in 2^{\mathbb {Z} \times \mathcal {M}}\) and \((\alpha _0,M_0)\notin S\),
where \(f_\mathrm{S}((\alpha _0,M_0))\) denotes the incremental marginal value of adding \((\alpha _0,M_0)\) to the set S (see Sect. 3.2). Finally, for \(S \subseteq S' \in 2^{\mathbb {Z}\times \mathcal {M}}\) and \((\alpha _0,M_0) \notin S'\), we have
Combining the above equation with Eq. (13) we get
or, in other words, f is submodular. \(\square \)
1.2 Proof of Theorem 2
Proof
Recall the submodular sum-throughput function f defined in Eq. (2). Let \(\{(\alpha _1,M_1),\ldots ,(\alpha _k,M_k)\}\) be the schedule returned by Algorithm 2. Let \(S_i = \{(\alpha _1,M_1),\ldots ,(\alpha _i,M_i)\}\) denote the schedule computed at the end of i iterations of the while loop and let \(S^*\) denote the optimal schedule. Now, since in the (\(i+1\))-th iteration \((\alpha _{i+1},M_{i+1})\) maximizes \(\frac{\Vert \min (\alpha M, T_\mathrm {rem}(i+1)) \Vert }{(\alpha + \delta )} = \frac{f_{S_i}(\{(\alpha , M)\})}{(\alpha + \delta )}\), we have, for any \((\alpha , M) \notin S_i\),
Now consider \(\mathtt {OPT} - f(S_i)\) for some \(i<k\). Since f is monotone, we have
where \(J^* := S^* \backslash S_i \) denotes the set of matchings that are present in the optimal solution but not in \(S_i\), Eq. (15) follows from Eq. (14), and Eq. (16) follows because \(\sum _{(\alpha ,M)\in J^*}(\alpha + \delta ) \le \sum _{(\alpha ,M)\in S^*}(\alpha + \delta ) \le W\). Next, observe that
where Eq. (17) follows from Eqs. (16) and (18) follows because of the identity \(1-x \le \hbox {e}^{-x}\). Now, since after the k-th iteration the while loop terminates, this implies \(\sum _{i'=1}^k (\alpha _{i'}+\delta ) > W\). However, if the entries of the input traffic matrix T are bounded by \(\epsilon W + \delta \), then no matching has a duration longer than \(\epsilon W\). In particular, \(\alpha _k + \delta \le \epsilon W \Rightarrow \sum _{i'=1}^{k-1}(\alpha _{i'}+\delta ) \ge W(1-\epsilon )\). Thus, setting \(i = k-2\) in Eq. (18), we have
Hence, we conclude \(\mathtt {ALG2} \ge \mathtt {OPT}(1 - \hbox {e}^{-(1-\epsilon )})\). \(\square \)
1.3 Correctness
Consider any traffic matrix \(T\in \mathbb {Z}^{n\times n}\). Let \(\mathcal {T}=\{T(i,j): i,j\le [n]\}\) denote the distinct entries in the matrix T. Then, in the following, we show that the maximizer in
occurs for \(\alpha \in \mathcal {T}\). To do this, for any matching \(M\in \mathcal {M}\) let us define \(f_M(\alpha ) \triangleq \Vert \min (\alpha M, T)\Vert _1\) and let \(f(\alpha ) \triangleq \max _{M\in \mathcal {M}} \frac{f_M(\alpha )}{\alpha + \delta }\). We then have the following proposition.
Proposition 1
\(f_M(\alpha )\) is (i) non-decreasing, (ii) piece-wise linear with corner points from \(\mathcal {T}\) and (iii) concave.
Proof
It is easy to see (i) because if \(\alpha _1\le \alpha _2\) then \(\min (\alpha _1 M,T)\le \min (\alpha _2 M,T)\) entrywise, and hence, \(f_M(\alpha _1) \le f_M(\alpha _2)\). To see (ii), consider any \(t_1 < t_2 \in \mathcal {T}\) such that no other element of \(\mathcal {T}\) is between \(t_1\) and \(t_2\). Then for \(t_1\le \alpha \le t_2\) we have
Thus, \(f_M(\cdot )\) is linear for \(t_1\le \alpha \le t_2\) and (ii) follows. (iii) also follows from Eq. (20) by observing that
for any \(t_1 < t_2\in \mathcal {T}\). Hence, the slope of the piece-wise linear function \(f_M(\alpha )\) is non-increasing as \(\alpha \) increases. In other words, \(f_M(\alpha )\) is concave. \(\square \)
Next, we consider a case where the matching is fixed.
Proposition 2
For a fixed matching M, we have \(\arg \max _{\alpha } \frac{f_M(\alpha )}{\alpha + \delta } \in \mathcal {T}\).
Proof
This follows from Proposition 1(ii). Let \(f_M(\alpha )\) be linear for \(\alpha \in [t_1,t_2]\). Then it can be written as \(f_M(\alpha ) = f_M(t_1) + m(\alpha - t_1)\) for some slope \(m\ge 0\). Now, consider the derivation of the function \(f_M(\alpha )/(\alpha + \delta )\) in the interval \([t_1,t_2]\):
Note that the numerator of Eq. (21) is independent of \(\alpha \) and the denominator is strictly positive. Hence, the sign (i.e., \(>0, <0\) or \(=0\)) of the slope of \(f_M(\alpha )/(\alpha + \delta )\) is the same in the interval \([t_1,t_2]\). This proves that the maximum must occur at either of the extreme points \(t_1\) or \(t_2\). By Proposition 1(ii), we know that \(f_M(\alpha )\) is piece-wise linear with the corner points from the set \(\mathcal {T}\). Thus, we can conclude that the maximum must occur at one of the points in \(\mathcal {T}\). \(\square \)
We are now ready to show that the maximizer of Eq. (19) occurs for \(\alpha \in \mathcal {T}\).
Theorem 4
\(\arg \max _{\alpha } f(\alpha ) \in \mathcal {T}\).
Proof
This follows directly from Proposition 2. Notice that
The maximizer of \(f_M(\alpha )/(\alpha + \delta )\) belongs to \(\mathcal {T}\) for any M. Hence, we conclude that the maximizer of \(f(\alpha )\) also belongs to \(\mathcal {T}\) and the theorem follows. \(\square \)
Rights and permissions
About this article
Cite this article
Bojja Venkatakrishnan, S., Alizadeh, M. & Viswanath, P. Costly circuits, submodular schedules and approximate Carathéodory Theorems. Queueing Syst 88, 311–347 (2018). https://doi.org/10.1007/s11134-017-9546-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11134-017-9546-x
Keywords
- Data center networks
- Bridges and switches
- Circuit networks
- Network flows
- Submodular optimization
- Approximation algorithms