[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3230543.3230577acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Public Access

Masking failures from application performance in data center networks with shareable backup

Published: 07 August 2018 Publication History

Abstract

Shareable backup is an economical and effective way to mask failures from application performance. A small number of backup switches are shared network-wide for repairing failures on demand so that the network quickly recovers to its full capacity without applications noticing the failures. This approach avoids complications and ineffectiveness of rerouting. We propose ShareBackup as a prototype architecture to realize this concept and present the detailed design. We implement ShareBackup on a hardware testbed. Its failure recovery takes merely 0.73ms, causing no disruption to routing; and it accelerates Spark and Tez jobs by up to 4.1X under failures. Large-scale simulations with real data center traffic and failure model show that ShareBackup reduces the percentage of job flows prolonged by failures from 47.2% to as little as 0.78%. In all our experiments, the results for ShareBackup have little difference from the no-failure case.

References

[1]
{n. d.}. Apache Spark, https://spark.apache.org. https://spark.apache.org
[2]
{n. d.}. Apache Tez, https://tez.apache.org/. https://tez.apache.org/
[3]
{n. d.}. Arduino, https://www.arduino.cc. https://www.arduino.cc
[4]
{n. d.}. Coflow-Benchmark, https://github.com/coflow/coflow-benchmark. https://github.com/coflow/coflow-benchmark
[5]
{n. d.}. Introducing data center fabric, the next-generation Facebook data center network, url = https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/.
[6]
{n. d.}. Raspberry Pi, https://www.raspberrypi.org. https://www.raspberrypi.org
[7]
Jung Ho Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber. November 2009. HyperX: Topology, Routing, and Packaging of Efficient Large-scale Networks. In SC '09. Portland, Oregon, USA, Article 41, 11 pages.
[8]
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. August 2008. A Scalable, Commodity Data Center Network Architecture. In SIGCOMM '08. Seattle, Washington, USA, 63--74.
[9]
Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitu Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. August 2010. DCTCP: Efficient Packet Transport for the Commoditized Data Center. In SIGCOMM'10. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.187.5830
[10]
Peter Bodík, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. August 2012. Surviving Failures in Bandwidth-constrained Datacenters. In SIGCOMM '12. Helsinki, Finland, 431--442.
[11]
Michael Borokhovich, Liron Schiff, and Stefan Schmid. 2014. Provable data plane connectivity with local fast failover: Introducing openflow graph algorithms. In Proceedings of the third workshop on Hot topics in software defined networking. ACM, 121--126.
[12]
Matthew Caesar, Martin Casado, Teemu Koponen, Jennifer Rexford, and Scott Shenker. 2010. Dynamic Route Recomputation Considered Harmful. SIGCOMM Comput. Commun. Rev. 40, 2 (April 2010), 66--71.
[13]
Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran, Lei Xu, Yueping Zhang, Xitao Wen, and Yan Chen. April 2012. OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility. In NSDI '12. San Joes, CA.
[14]
K. Chen, X. Wen, X. Ma, Y. Chen, Y. Xia, C. Hu, and Q. Dong. 2015. WaveCube: A Scalable, Fault-tolerant, High-performance Optical Data Center Architecture. In 2015 IEEE Conference on Computer Communications (INFOCOM). 1903--1911.
[15]
Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. August 2010. Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers. In SIGCOMM '10. New Delhi, India, 339--350.
[16]
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In Proceedings of the ACM SIGCOMM 2011 Conference (SIGCOMM '11). ACM, New York, NY, USA, 350--361.
[17]
Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: A Scalable and Flexible Data Center Network. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (SIGCOMM '09). ACM, New York, NY, USA, 51--62.
[18]
Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. August 2009. BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers. In SIGCOMM '09. Barcelona, Spain, 63--74.
[19]
Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and Songwu Lu. August 2008. DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers. In SIGCOMM '08. Seattle, Washington, USA, 75--86.
[20]
Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie Chaiken. November 2009. The Nature of Data Center Traffic. In IMC '09. Chicago, Illinois, USA, 202--208.
[21]
Simon Kassing, Asaf Valadarsky, Gal Shahaf, Michael Schapira, and Ankit Singla. 2017. Beyond Fat-trees Without Antennae, Mirrors, and Disco-balls. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ' 17). ACM, Los Angeles, CA, USA, 281--294.
[22]
Karthik Lakshminarayanan, Matthew Caesar, Murali Rangan, Tom Anderson, Scott Shenker, and Ion Stoica. 2007. Achieving convergence-free routing using failure-carrying packets. ACM SIGCOMM Computer Communication Review 37, 4 (2007), 241--252.
[23]
Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams, and Xiaohan Zhao. 2016. XFabric: A Re-configurable In-Rack Network for Rack-Scale Computers. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, Santa Clara, CA, 15--29. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/legtchenko
[24]
Tom Leighton and Satish Rao. November 1999. Multicommodity Max-flow Min-cut Theorems and Their Use in Designing Approximation Algorithms. J. ACM 46, 6 (November 1999), 787--832.
[25]
Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, and Scott Shenker. 2013. Ensuring Connectivity via Data Plane Mechanisms. In NSDI. 113--126.
[26]
Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas Anderson. 2013. F10: A Fault-Tolerant Engineered Network. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX, Lombard, IL, 399--412. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/liu_vincent
[27]
Vincent Liu, Danyang Zhuo, Simon Peter, Arvind Krishnamurthy, and Thomas Anderson. 2015. Subways: A Case for Redundant, Inexpensive Data Center Edge Links. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT '15). ACM, Heidelberg, Germany, Article 27, 13 pages.
[28]
Yunpeng James Liu, Peter Xiang Gao, Bernard Wong, and Srinivasan Keshav. August 2014. Quartz: A New Design Element for Low-latency DCNs. In SIGCOMM '14. Chicago, Illinois, USA, 283--294.
[29]
Suksant Sae Lor, Raul Landa, and Miguel Rio. 2010. Packet re-cycling: eliminating packet losses due to network failures. In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks. ACM, 2.
[30]
Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. 2009. PortLand: A Scalable Fault-tolerant Layer 2 Data Center Network Fabric. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication (SIGCOMM '09). ACM, New York, NY, USA, 39--50.
[31]
Pan P., Swallo G., and Atlas A. 1998. Fast Reroute Extensions to RSVP-TE forLSP Tunnels. RFC 4090 (1998).
[32]
George Porter, Richard Strong, Nathan Farrington, Alex Forencich, Pang Chen-Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat. August 2013. Integrating Microsecond Circuit Switching into the Data Center. In SIGCOMM '13. Hong Kong, China, 447--458.
[33]
Mark Reitblatt, Marco Canini, Arjun Guha, and Nate Foster. 2013. FatTire: Declarative Fault Tolerance for Software-defined Networks. In Proceedings of the Second ACM SIGCOMM Workshop on Hot Topics in Software Defined Networking (HotSDN '13). ACM, Hong Kong, China, 109--114.
[34]
Michael Schlansker, Michael Tan, Jean Tourrilhes, Jose Renato Santos, and Shih-Yuan Wang. 2013. Configurable optical interconnects for scalable datacenters. In Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference (OFC/NFOEC), 2013. IEEE, 1--3.
[35]
Tae Joon Seok, Niels Quack, Sangyoon Han, Wencong Zhang, Richard S Muller, and Ming C Wu. 2015. Reliability study of digital silicon photonic MEMS switches. In Group IV Photonics (GFP), 2015 IEEE 12th International Conference on. IEEE, 205--206.
[36]
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. August 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In SIGCOMM '15. ACM, London, United Kingdom, 183--197.
[37]
Ankit Singla. 2015. Designing Data Center Networks for High Throughput. Ph.D. Thesis. University of Illinois at Urbana-Champaign.
[38]
Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. April 2012. Jellyfish: Networking Data Centers Randomly. In NSDI '12. San Jose, California, USA, 1--14. arXiv:1110.1687 http://arxiv.org/abs/1110.1687
[39]
Brent Stephens and Alan L Cox. 2016. Deadlock-free local fast failover for arbitrary data center networks. In Computer Communications, IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on. IEEE, 1--9.
[40]
Brent Stephens, Alan L. Cox, and Scott Rixner. 2016. Scalable Multi-Failure Fast Failover via Forwarding Table Compression. In Proceedings of the Symposium on SDN Research (SOSR '16). ACM, Santa Clara, CA, Article 9, 12 pages.
[41]
Li T., Cole B., Morton P., and Li D. 1998. Cisco Hot Standby Router Protocol (HSRP). RFC 2281 (1998).
[42]
Asaf Valadarsky, Gal Shahaf, Michael Dinitz, and Michael Schapira. 2016. Xpander: Towards Optimal-Performance Datacenters. In Proceedings of the 12th International on Conference on Emerging Networking Experiments and Technologies (CoNEXT '16). ACM, Irvine, California, USA, 205--219.
[43]
Meg Walraed-Sullivan, Amin Vahdat, and Keith Marzullo. 2013. Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and Cost. In Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT '13). ACM, New York, NY, USA, 85--96.
[44]
Guohui Wang, David G. Andersen, Michael Kaminsky, Konstantina Papagiannaki, T. S. Eugene Ng, Michael Kozuch, and Michael Ryan. August 2010. c-Through: Part-time Optics in Data Centers. In SIGCOMM '10. New Delhi, India, 327--338.
[45]
Christo Wilson, Hitesh Ballani, Thomas Karagiannis, and Ant Rowtron. 2011. Better Never Than Late: Meeting Deadlines in Datacenter Networks. In Proceedings of the ACM SIGCOMM 2011 Conference (SIGCOMM '11). ACM, Toronto, Ontario, Canada, 50--61.
[46]
M. C. Wu, O. Solgaard, and J. E. Ford. 2006. Optical MEMS for Lightwave Communication. Journal of Lightwave Technology 24, 12 (December 2006), 4433--4454.
[47]
Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. August 2012. NetPilot: Automating Datacenter Network Failure Mitigation. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '12). Helsinki, Finland, 419--430.
[48]
Yiting Xia, Xin Sunny Huang, and T. S. Eugene Ng. December 2017. Stop Rerouting! Enabling ShareBackup for Failure Recovery in Data Center Networks. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks (HotNets '17). Palo Alto, CA, 171--177.
[49]
Yiting Xia and T. S. Eugene Ng. November 2016. Flat-tree: A Convertible Data Center Network Architecture from Clos to Random Graph. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets '16). Atlanta, GA, 71--77.
[50]
Yiting Xia, Xiaoye Steven Sun, Simbarashe Dzinamarira, Dingming Wu, Xin Sunny Huang, and T. S. Eugene Ng. 2017. A Tale of Two Topologies: Exploring Convertible Data Center Network Architectures with Flat-tree. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17). ACM, New York, NY, USA, 295--308.
[51]
Baohua Yang, Junda Liu, Scott Shenker, Jun Li, and Kai Zheng. 2014. Keep forwarding: Towards k-link failure resilient routing. In INFOCOM, 2014 Proceedings IEEE. IEEE, 1617--1625.
[52]
Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Klaus-Tycho Förster, Arvind Krishnamurthy, and Thomas Anderson. 2017. Understanding and Mitigating Packet Corruption in Data Center Networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17). ACM, Los Angeles, CA, 362--375.

Cited By

View all
  • (2024)HorusProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691826(1-22)Online publication date: 16-Apr-2024
  • (2024)DEMO: An Open Research Framework for Optical Data Center NetworksProceedings of the ACM SIGCOMM 2024 Conference: Posters and Demos10.1145/3672202.3673712(86-88)Online publication date: 4-Aug-2024
  • (2024)Distributionally Robust Coordinated Defense Strategy for Time-Sensitive Networking Enabled Cyber–Physical Power SystemIEEE Transactions on Smart Grid10.1109/TSG.2024.335321415:3(3278-3287)Online publication date: May-2024
  • Show More Cited By

Index Terms

  1. Masking failures from application performance in data center networks with shareable backup

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SIGCOMM '18: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication
        August 2018
        604 pages
        ISBN:9781450355674
        DOI:10.1145/3230543
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 August 2018

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. circuit switching
        2. data center network
        3. failure recovery

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        SIGCOMM '18
        Sponsor:
        SIGCOMM '18: ACM SIGCOMM 2018 Conference
        August 20 - 25, 2018
        Budapest, Hungary

        Acceptance Rates

        Overall Acceptance Rate 462 of 3,389 submissions, 14%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)268
        • Downloads (Last 6 weeks)39
        Reflects downloads up to 06 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)HorusProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691826(1-22)Online publication date: 16-Apr-2024
        • (2024)DEMO: An Open Research Framework for Optical Data Center NetworksProceedings of the ACM SIGCOMM 2024 Conference: Posters and Demos10.1145/3672202.3673712(86-88)Online publication date: 4-Aug-2024
        • (2024)Distributionally Robust Coordinated Defense Strategy for Time-Sensitive Networking Enabled Cyber–Physical Power SystemIEEE Transactions on Smart Grid10.1109/TSG.2024.335321415:3(3278-3287)Online publication date: May-2024
        • (2024)K-Backup: Load- and TCAM-Aware Multi-Backup Fast Failure Recovery in SDNsIEEE/ACM Transactions on Networking10.1109/TNET.2024.338609132:4(3347-3360)Online publication date: Aug-2024
        • (2024)Rearchitecting Datacenter Networks: A New Paradigm with Optical Core and Optical EdgeIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621224(1371-1380)Online publication date: 20-May-2024
        • (2023)Improving Network Availability with Protective ReRouteProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604867(684-695)Online publication date: 10-Sep-2023
        • (2023)Dependable Virtualized Fabric on Programmable Data PlaneIEEE/ACM Transactions on Networking10.1109/TNET.2022.322461731:4(1748-1764)Online publication date: Aug-2023
        • (2023)A Survey on Rerouting Techniques with P4 Programmable Data Plane SwitchesComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2023.109795230:COnline publication date: 1-Jul-2023
        • (2023)Proof of Storage with Corruption Identification and Recovery for Dynamic Group UsersMobile Multimedia Communications10.1007/978-3-031-23902-1_10(126-141)Online publication date: 1-Feb-2023
        • (2021)Enhancing Robustness of Per-Packet Load-Balancing for Fat-TreeApplied Sciences10.3390/app1106266411:6(2664)Online publication date: 17-Mar-2021
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media