More Web Proxy on the site http://driver.im/

research-article

Partial Network Partitioning

Authors:

Basil Alkhatib,

Sreeharsha Udayashankar,

Ahmed Alquraan,

Mohammed Alfatafta,

Wael Al-Manasrah,

Alex Depoutovitch,

Samer Al-KiswanyAuthors Info & Claims

ACM Transactions on Computer Systems, Volume 41, Issue 1-4

Article No.: 1, Pages 1 - 34

https://doi.org/10.1145/3576192

Published: 18 December 2023 Publication History

Abstract

We present an extensive study focused on partial network partitioning. Partial network partitions disrupt the communication between some but not all nodes in a cluster. First, we conduct a comprehensive study of system failures caused by this fault in 13 popular systems. Our study reveals that the studied failures are catastrophic (e.g., lead to data loss), easily manifest, and are mainly due to design flaws. Our analysis identifies vulnerabilities in core systems mechanisms including scheduling, membership management, and ZooKeeper-based configuration management.

Second, we dissect the design of nine popular systems and identify four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault tolerance techniques are inadequate for modern systems; they either patch a particular mechanism or lead to a complete cluster shutdown, even when alternative network paths exist.

Finally, our findings motivate us to build Nifty, a transparent communication layer that masks partial network partitions. Nifty builds an overlay between nodes to detour packets around partial partitions. Nifty provides an approach for applications to optimize their operation during a partial partition. We demonstrate the benefit of this approach through integrating Nifty with VoltDB, HDFS, and Kafka.

References

[1]

Daniel Turner, Kirill Levchenko, Jeffrey C. Mogul, Stefan Savage, Alex C. Snoeren, Daniel Turner, Kirill Levchenko, Jeffrey C. Mogul, Stefan Savage, and Alex C. Snoeren. 2012. On Failure in Managed Enterprise Networks. Technical Report HPL-2012-101. HP Labs.

[2]

Cisco Systems Inc. 2004. Data Center: Load Balancing Data Center, Solutions Reference Network Design. Technical Report. Cisco Systems Inc.

[3]

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or die: High-availability design principles drawn from Google’s network infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, New York, NY, 58–72.

Digital Library

[4]

Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, et al. 2013. B4: Experience with a globally-deployed software defined WAN. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 3–14.

[5]

Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. ACM SIGCOMM Computer Communication Review 41, 4 (2011), 350–361.

Digital Library

[6]

Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. 2011. California fault lines: Understanding the causes and impact of network failures. ACM SIGCOMM Computer Communication Review 41, 4 (2011), 315–326.

Digital Library

[7]

Eric A. Brewer. 2001. Lessons from giant-scale services. IEEE Internet Computing 5, 4 (2001), 46–55.

Digital Library

[8]

David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why do internet services fail, and what can be done about it? In Proceedings of the USENIX Symposium on Internet Technologies and Systems, Vol. 67.

[9]

Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, et al. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). 49–60.

[10]

Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, and Harsha V. Madhyastha. 2013. SPANStore: Cost-effective geo-replicated storage spanning multiple cloud services. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 292–308.

Digital Library

[11]

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, et al. 2013. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems 31, 3 (2013), 8.

Digital Library

[12]

Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany. 2018. An analysis of network-partitioning failures in cloud systems. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 51–68.

[13]

Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51–59. DOI:

Digital Library

[14]

Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 305–319.

[15]

Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (2001), 18–25.

[16]

Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer, and Carl H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). 172–182.

Digital Library

[17]

Barbara Liskov and James Cowling. 2012. Viewstamped Replication Revisited. Technical Report MIT-CSAIL-TR-2012-021. MIT.

[18]

RabbitMQ. n.d. RabbitMQ Message Broker. Retrieved June 1, 2021 from https://www.rabbitmq.com.

[19]

Volt. n.d. VoltDB In-Memory Database Platform. Retrieved June 1, 2021 from https://www.voltdb.com/.

[20]

Ceph. n.d. The Ceph Object Store. Retrieved June 1, 2021 from https://ceph.io/.

[21]

Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, and Samer Al-Kiswany. 2020. Toward a generic fault tolerance technique for partial network partitioning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 351–368. https://www.usenix.org/conference/osdi20/presentation/alfatafta.

[22]

Robin J. Wilson. 2010. Introduction to Graph Theory. Prentice Hall/Pearson, New York, NY.

[23]

Tony Mills. 2011. bnx2 Cards Intermittently Going Offline. Retrieved June 1, 2021 from https://www.spinics.net/lists/netdev/msg152880.html.

[24]

CloudFlare Blog. 2020. A Byzantine Failure in the Real World. Retrieved June 1, 2021 from https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/.

[25]

Google Cloud. 2019. Google Cloud Networking Incident #18003. Retrieved June 1, 2021 from https://status.cloud.google.com/incident/cloud-networking/18003.

[26]

Andrey Falko. 2019. Lyft Engineering: Operating Apache Kafka Clusters 24/7 Without a Global Ops Team. Retrieved June 1, 2021 from https://eng.lyft.com/operating-apache-kafka-clusters-24-7-without-a-global-ops-team-417813a5ce70.

[27]

Datadog. 2013. Learning from AWS Failure. Retrieved June 1, 2021 from https://www.datadoghq.com/blog/gray-aws-failures/.

[28]

Simon J. Maple and Ian Robinson. 2015. Transaction recovery in a transaction processing computer system employing multiple transaction managers. US Patent 9,165,025.

[29]

Christian Maihofer. 2004. A survey of geocast routing protocols. IEEE Communications Surveys & Tutorials 6, 2 (2004), 32–42.

Digital Library

[30]

Matthew Milano and Andrew C. Myers. 2018. MixT: A language for mixing consistency in geodistributed transactions. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 226–241.

Digital Library

[31]

David Turner. 2017. Observability in Paxos Clusters. Retrieved June 1, 2021 from https://davecturner.github.io/2017/08/18/observability-in-paxos.html.

[32]

Rachel by the Bay. 2012. Partial Network Partitions and Obstacles to Innovation. Retrieved June 1, 2021 from https://rachelbythebay.com/w/2012/02/16/partition/.

[33]

GitHub. 2014. Partial Network Partition and Retries. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/6105.

[34]

Robust Perception. 2015. Healthchecking Is Not Transitive. Retrieved June 1, 2021 from https://www.robustperception.io/healthchecking-is-not-transitive.

[35]

GitHub. 2015. Cluster Broken After Switches Upgrade. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/9495.

[36]

Apache. n.d. Using Map Output Fetch Failures to Blacklist Nodes Is Problematic. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/MAPREDUCE-1800.

[37]

Elastic. n.d. Elasticsearch: Distributed Search & Analytics. Retrieved June 1, 2021 from https://www.elastic.co/products/elasticsearch.

[38]

MongoDB. n.d. MongoDB: The Database for Modern Applications. Retrieved June 1, 2021 from https://www.mongodb.com/.

[39]

Apache Hadoop. n.d. The Apache Hadoop Project. Retrieved June 1, 2021 from http://hadoop.apache.org/.

[40]

Apache HBase. n.d. Apache HBase. Retrieved June 1, 2021 from https://hbase.apache.org/.

[41]

Apache Mesos. n.d. Apache Mesos. Retrieved June 1, 2021 from http://mesos.apache.org/.

[42]

Hazelcast. n.d. Hazelcast | The Leading In-Memory Computing Platform. Retrieved June 1, 2021 from https://hazelcast.com/.

[43]

Apache Kafka. n.d. Kafka: A Distributed Streaming Platform. Retrieved June 1, 2021 from https://kafka.apache.org/.

[44]

MooseFS. n.d. MooseFS: Distributed File System. Retrieved June 1, 2021 from https://moosefs.com/.

[45]

ActiveMQ. n.d. ActiveMQ: Flexible & Powerful Open Source Multi-Protocol Messaging. Retrieved June 1, 2021 from http://activemq.apache.org/.

[46]

Dkron. n.d. Dkron: A Distributed Cron Service. Retrieved June 1, 2021 from https://dkron.io/.

[47]

MongoDB. n.d. Arbiters in pv1 Should Vote No in Elections If They Can See a Healthy Primary of Equal or Greater Priority to the Candidate. Retrieved June 1, 2021 from https://jira.mongodb.org/browse/SERVER-27125.

[48]

Apache. n.d. Possible Data Loss When RS Goes into GC Pause While Rolling HLog. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/HBASE-2312.

[49]

GitHub. 2014. Partial Network Partition and Retries. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/6105.

[50]

Hazelcast. n.d. Hazelcast: The Leading In-Memory Data Grid. Retrieved June 1, 2021 from https://hazelcast.com/.

[51]

Redis. n.d. Redis: In-Memory Data Structure Store. Retrieved June 1, 2021 from https://redis.io/.

[52]

A. Herr. 2016. Veritas Cluster Server 6.2 I/O Fencing Deployment Considerations. Technical Report. Veritas Technologies.

[53]

MongoDB. n.d. Two Primaries with Network Partitioned Replica Set (Non-Transient). Retrieved June 1, 2021 from https://jira.mongodb.org/browse/SERVER-2544.

[54]

GitHub. n.d. Synchronisation Causes Crash in Duplicated Master #1006. Retrieved June 1, 2021 from https://github.com/rabbitmq/rabbitmq-server/issues/1006.

[55]

Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing (PODC’88). ACM, New York, NY, 8–17. DOI:

Digital Library

[56]

GitHub. n.d. Partial Network Partitioning Leads to Cluster Unavailability. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/43183.

[57]

Apache ZooKeeper. n.d. Apache ZooKeeper. Retrieved June 1, 2021 from https://zookeeper.apache.org/.

[58]

Apache ZooKeeper. n.d. ZooKeeper Recipes and Solutions. Retrieved June 1, 2021 from https://zookeeper.apache.org/doc/current/recipes.html.

[59]

Apache. n.d. ActiveMQ Cluster Blocks Indefinitely in the Presence of Partial Network Partition. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/AMQ-7064.

[60]

Apache. n.d. Kafka Leader Election Doesn’t Happen When Leader Broker Pport Is Partitioned Off the Network. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/KAFKA-8702.

[61]

Giorgos Myrianthous. 2021. Kafka No Longer Requires ZooKeeper. Retrieved June 1, 2021 from https://towardsdatascience.com/kafka-no-longer-requires-zookeeper-ebfbf3862104.

[62]

Colin McCabe. 2020. Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency. Retrieved June 1, 2021 from https://www.confluent.io/blog/removing-zookeeper-dependency-in-kafka/.

[63]

Apache. n.d. MR AM Can Get in a Split Brain Situation. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/MAPREDUCE-4832.

[64]

Apache. n.d. Mesos-1529: Handle a Network Partition Between Master and Slave. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/MESOS-1529.

[65]

GitHub. n.d. Disconnect Between Coordinating Node and Shards Can Cause Duplicate Updates or Wrong Status Code.Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/9967.

[66]

GitHub. n.d. Mirrored Queue Crash with Out of Sync ACKs. Retrieved June 1, 2021 from https://github.com/rabbitmq/rabbitmq-server/issues/749.

[67]

Apache. n.d. Kafka Producer Is Not Fault Tolerant. Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/KAFKA-3686.

[68]

GitHub. n.d. Partial Network Partition and Retries. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/6105.

[69]

Apache. n.d. NameNode Should Give Client the First Node in the Pipeline from Different Rack Other Than That of ExcludedNodes List in the Same Rack.Retrieved June 1, 2021 from https://issues.apache.org/jira/browse/HDFS-1384.

[70]

Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB main memory DBMS. IEEE Data Engineering Bulletin 36, 2 (2013), 21–27.

[71]

GitHub. n.d. LogCabin. Retrieved June 1, 2021 from https://github.com/logcabin/logcabin.

[72]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press, Cambridge, MA.

Digital Library

[73]

Coen Bron and Joep Kerbosch. 1973. Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16, 9 (Sept. 1973), 575–577. DOI:

Digital Library

[74]

GitHub. n.d. Be More Resilient to Partial Network Partitions. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/pull/8720.

[75]

Apache Mesos. n.d. Designing Highly Available Mesos Frameworks. Retrieved June 1, 2021 from http://mesos.apache.org/documentation/latest/high-availability-framework-guide/.

[76]

GitHub. n.d. Wait on Shard Failures. Retrieved June 1, 2021 from https://github.com/elastic/elasticsearch/issues/14252.

[77]

Deep Medhi and Karthik Ramasamy. 2017. Network Routing: Algorithms, Protocols, and Architectures. Morgan Kaufmann.

[78]

Dimitri P. Bertsekas, Robert G. Gallager, and Pierre Humblet. 1992. Data Networks. Vol. 2. Prentice Hall International, Hoboken, NJ.

Digital Library

[79]

Open Networking Foundation. 2015. OpenFlow Switch Specification, Version 1.5.1 (ONF TS-025). Open Networking Foundation.

[80]

Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, et al. 2015. The design and implementation of Open vSwitch. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). 117–130.

[81]

iPerf. n.d. iPerf: The Ultimate Speed Test Tool for TCP, UDP and SCTP. Retrieved June 1, 2021 from https://iperf.fr/.

[82]

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, New York, NY, 143–154. DOI:

Digital Library

[83]

2018. TPC-H BENCHMARK (Decision Support) Standard Specification. Revision 2.18.0. Transaction Processing Performance Council.

[84]

Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, New York, NY, 193–204.

Digital Library

[85]

Robert Birke, Ioana Giurgiu, Lydia Y. Chen, Dorothea Wiesmann, and Ton Engbersen. 2014. Failure analysis of virtual and physical machines: Patterns, causes and characteristics. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, Los Alamitos, CA, 1–12.

Digital Library

[86]

Daniel Ford, François Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in Globally Distributed Storage Systems. Retrieved December 23, 2022 from https://static.googleusercontent.com/media/research.google.com/en/pubs/archive/36737.pdf.

[87]

Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky. 2008. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage 4, 3 (2008), 7.

Digital Library

[88]

Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail. In Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, Los Alamitos, CA, 1–12.

Digital Library

[89]

Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, and Ramendra Sahoo. 2006. BlueGene/L failure analysis and prediction models. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE, Los Alamitos, CA, 425–434.

Digital Library

[90]

Bianca Schroeder and Garth Gibson. 2009. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing 7, 4 (2009), 337–350.

Digital Library

[91]

Theophilus Benson, Sambit Sahu, Aditya Akella, and Anees Shaikh. 2010. A first look at problems in the cloud. HotCloud 10 (2010), 15.

[92]

Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Haibo Lin, Haoxiang Lin, and Tingting Qin. 2015. An empirical study on quality issues of production big data platform. In Proceedings of the 37th International Conference on Software Engineering—Volume 2. IEEE, Los Alamitos, CA, 17–26.

Digital Library

[93]

Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing. ACM, New York, NY, 1–16.

Digital Library

[94]

Ariel Rabkin and Randy Howard Katz. 2012. How hadoop clusters break. IEEE Software 30, 4 (2012), 88–94.

Digital Library

[95]

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-Anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, et al. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing. ACM, New York, NY, 1–14.

Digital Library

[96]

Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie. 2013. A characteristic study on failures of production distributed data-parallel programs. In Proceedings of the 2013 International Conference on Software Engineering. IEEE, Los Alamitos, CA, 963–972.

Digital Library

[97]

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 249–265.

[98]

David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris. 2001. Resilient overlay networks. ACM SIGOPS Operating Systems Review 35, 5 (Oct. 2001), 131–145. DOI:

Digital Library

[99]

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM Computer Communication Review 31, 4 (Aug. 2001), 149–160. DOI:

Digital Library

[100]

Antony Rowstron and Peter Druschel. 2001. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware 2001, Rachid Guerraoui (Ed.). Springer, Berlin, Germany, 329–350.

[101]

Yuchao Zhang, Junchen Jiang, Ke Xu, Xiaohui Nie, Martin J. Reed, Haiyang Wang, Guang Yao, Miao Zhang, and Kai Chen. 2018. BDS: A centralized near-optimal overlay network for inter-datacenter data replication. In Proceedings of the 13th EuroSys Conference (EuroSys’18). ACM, New York, Article 10, 14 pages. DOI:

Digital Library

[102]

John Byers, Jeffrey Considine, Michael Mitzenmacher, and Stanislav Rost. 2002. Informed content delivery across adaptive overlay networks. InProceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’02). ACM, New York, NY, 47–60. DOI:

Digital Library

[103]

Michael Dalton, David Schultz, Jacob Adriaens, Ahsan Arefin, Anshuman Gupta, Brian Fahs, Dima Rubinstein, et al. 2018. Andromeda: Performance, isolation, and velocity at scale in cloud network virtualization. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18). 373–387.

[104]

An Wang, Yang Guo, Songqing Chen, Fang Hao, T. V. Lakshman, Doug Montgomery, and Kotikalapudi Sriram. 2017. vPROM: VSwitch enhanced programmable measurement in SDN. In Proceedings of the 2017 IEEE 25th International Conference on Network Protocols (ICNP’17). IEEE, Los Alamitos, CA, 1–10.

[105]

Zili Zha, An Wang, Yang Guo, Doug Montgomery, and Songqing Chen. 2018. Instrumenting Open vSwitch with monitoring capabilities: Designs and challenges. In Proceedings of the Symposium on SDN Research. ACM, New York, NY, 16.

Digital Library

[106]

Pakapol Krongbaramee and Yuthapong Somchit. 2018. Implementation of SDN stateful firewall on data plane using Open vSwitch. In Proceedings of the 2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE’18). IEEE, Los Alamitos, CA, 1–5.

[107]

Anat Bremler-Barr, David Hay, Idan Moyal, and Liron Schiff. 2017. Load balancing memcached traffic using software defined networking. In Proceedings of the 2017 IFIP Networking Conference (IFIP Networking’17) and Workshops. IEEE, Los Alamitos, CA, 1–9.

[108]

Alex F. R. Trajano and Marcial P. Fernandez. 2015. Two-phase load balancing of in-memory key-value storages through NFV and SDN. In Proceedings of the 2015 IEEE Symposium on Computers and Communication (ISCC’15). IEEE, Los Alamitos, CA, 409–414.

Digital Library

[109]

I. Kettaneh, A. Alquraan, H. Takruri, S. Yang, A. S. Dusseau, R. Arpaci-Dusseau, and S. Al-Kiswany. 2020. The network-integrated storage system. IEEE Transactions on Parallel and Distributed Systems 31, 2 (2020), 486–500. DOI:

Digital Library

[110]

Xiaozhou Li, Raghav Sethi, Michael Kaminsky, David G. Andersen, and Michael J. Freedman. 2016. Be fast, cheap and in control with SwitchKV. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). 31–44.

Cited By

Khaleel SUdayashankar SAl-Kiswany S(2024)Slicify: Fault Injection Testing for Network Partitions2024 32nd International Conference on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS64422.2024.10786337(1-8)Online publication date: 21-Oct-2024
https://doi.org/10.1109/MASCOTS64422.2024.10786337
Qunaibi SUdayashankar SAl-Kiswany S(2023)CASPR: Connectivity-Aware Scheduling for Partition Resilience2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00017(70-81)Online publication date: 25-Sep-2023
https://doi.org/10.1109/SRDS60354.2023.00017

Index Terms

Partial Network Partitioning
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
  2. Dependable and fault-tolerant systems and networks
    1. Availability
    2. Reliability
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Model Predictive Control for Partial Differential Equations
The optimal graph partitioning problem

In this paper we consider the problem of partitioning the set of nodes in a graph in at mostp classes, such that the sum of node weights in any class is not greater than the class capacityb, and such that the sum of edge weights, for edges connecting ...
Evaluating Reliability Improvements of Fault Tolerant Array Processors Using Algorithm-Based Fault Tolerance

Algorithm-based fault tolerance (ABFT) is used to provide low-cost error protection for VLSI processor arrays used in real-time digital signal processing. The main objective of incorporating an ABFT technique in a processor array is to improve its ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 41, Issue 1-4

November 2023

188 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/3637801

Editor:
Michael Swift
University of Wisconsin, Madison, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 December 2023

Online AM: 19 December 2022

Accepted: 17 November 2022

Revised: 22 March 2022

Received: 14 September 2021

Published in TOCS Volume 41, Issue 1-4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSERC Discovery grant
Canada Foundation for Innovation (CFI) grant
NSERC Collaborative Research and Development (CRD) grant
Waterloo-Huawei Joint Innovation Lab grant
IBM Ph.D. fellowship

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
923
Total Downloads

Downloads (Last 12 months)536
Downloads (Last 6 weeks)59

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khaleel SUdayashankar SAl-Kiswany S(2024)Slicify: Fault Injection Testing for Network Partitions2024 32nd International Conference on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS64422.2024.10786337(1-8)Online publication date: 21-Oct-2024
https://doi.org/10.1109/MASCOTS64422.2024.10786337
Qunaibi SUdayashankar SAl-Kiswany S(2023)CASPR: Connectivity-Aware Scheduling for Partition Resilience2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00017(70-81)Online publication date: 25-Sep-2023
https://doi.org/10.1109/SRDS60354.2023.00017

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents