Abstract
Big data applications based on graphs need to be scalable enough for handling immense growth in size of graphs, efficiently. Scalable graph processing typically handles the high workload by increasing the number of computing nodes. However, this increases the chances of single or multiple node (multi-node) failures. Failures may occur during normal job execution, as well as during recovery. Most of the systems for failure detection either follow checkpoint-based recovery which has high computation cost, or follows replication that has high memory overhead. In this work, we have proposed an unsupervised learning-based failure-recovery scheme for graph processing systems that detects different kinds of failures and allows node recovery within a shorter amount of time. It has been able to provide enhanced performance as compared to traditional failure-recovery models with respect to simultaneous recovery from single and multi-node failures, memory overload and computational latency. Evaluating its performance on four benchmark datasets has reinforced its strength and makes the proposed model completely fit in with the status quo.
Similar content being viewed by others
Data availability
Datasets 1 and 2 can be freely downloaded from https://snap.stanford.edu/data/email-Eu-core-temporal.html. Dataset 3 is available at https://snap.stanford.edu/data/ego-Facebook.html. Dataset 4 can be downloaded from https://snap.stanford.edu/data/com-LiveJournal.html.
Code availability
The proposed failure recovery mechanism has been implemented in Python3 and is freely available at https://github.com/aradhita1988/Failure-recovery.
References
Huang J, Qin W, Wang X, Chen W (2020) Survey of external memory large-scale graph processing on a multi-core system. J Supercomput 76(1):549–579
Chen R, Yao Y, Wang P, Zhang K, Wang Z, Guan H, Zang B, Chen H (2017) Replication-based fault-tolerance for large-scale graph processing. IEEE Trans Parallel Distrib Syst 29(7):1621–1635
Le QV (2013) Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp. 8595–8598
Dobre C, Xhafa F (2014) Parallel programming paradigms and frameworks in big data era. Int J Parallel Prog 42(5):710–738
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. In: 10th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 12), pp. 17–30
Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning in the cloud. arXiv preprint arXiv:1204.6078
Lu W, Shen Y, Wang T, Zhang M, Jagadish HV, Du X (2018) Fast failure recovery in vertex-centric distributed graph processing systems. IEEE Trans Knowl Data Eng 31(4):733–746
Zhao Y, Yoshigoe K, Xie M, Bian J, Xiong K (2020) L-powergraph: a lightweight distributed graph-parallel communication mechanism. J Supercomput 76(3):1850–1879
Shen Y, Chen G, Jagadish H, Lu W, Ooi BC, Tudor BM (2014) Fast failure recovery in distributed graph processing systems. Proc VLDB Endow 8(4):437–448
Margo D, Seltzer M (2015) A scalable distributed graph partitioner. Proc VLDB Endow 8(12):1478–1489
Robinson DC, Hand JA, Madsen MB, McKelvey KR (2018) The Dat Project, an open and decentralized research data tool. Scientific data 5(1):1–4
Blähser J, Göller T, Böhmer M (2021) Thine-approach for a fault tolerant distributed packet manager based on hypercore protocol. In: 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), IEEE, pp. 1778–1782
Robinson DC, Hand JA, Madsen MB, McKelvey KR (2018) The dat project, an open and decentralized research data tool. Sci Data 5:180221. https://doi.org/10.1038/sdata.2018.221
Tarr D, Lavoie E, Meyer A, Tschudin C (2019) Secure scuttlebutt: an identity-centric protocol for subjective and decentralized applications. In: Proceedings of the 6th ACM Conference on Information-Centric Networking, pp. 1–11
Tsipenyuk GY (2018) Evaluation of decentralized email architecture and social network analysis based on email attachment sharing. Tech. rep., University of Cambridge, Computer Laboratory, https://doi.org/10.17863/CAM.21035
Sandoval IV, Atashpendar A, Lenzini G, Ryan PY (2021) Pakemail: authentication and key management in decentralized secure email and messaging via pake. arXiv preprint arXiv:2107.06090
Kermarrec AM, Lavoie E, Tschudin C (2020) Gossiping with append-only logs in secure-scuttlebutt. In: Proceedings of the 1st International Workshop on Distributed Infrastructure for Common Good, pp. 19–24
Paul HS, Gupta A, Sharma A (2006) Finding a suitable checkpoint and recovery protocol for a distributed application. J Parallel Distrib Comput 66(5):732–749
Dathathri R, Gill G, Hoang L, Pingali K (2019) Phoenix: a substrate for resilient distributed graph analytics. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 615–630
Tschudin C (2019) A broadcast-only communication model based on replicated append-only logs. ACM SIGCOMM Comput Commun Rev 49(2):37–43
Singh A, Ngan TW, Druschel P, Wallach DS (2006) Eclipse attacks on overlay networks: threats and defenses. In: Proceedings IEEE INFOCOM 2006 25TH IEEE International Conference on Computer Communications, pp. 1–12
Roy C, Chakraborty D, Debnath S, Mukherjee A, Chaki N (2021) Single failure recovery in distributed social network. In: Hong T, Wojtkiewicz K, Chawuthai R, Sitek P (eds) Recent Challenges in Intelligent Information and Database Systems - 13th Asian Conference, ACIIDS 2021, Phuket, Thailand, April 7-10, 2021, Proceedings, Springer, Communications in Computer and Information Science, vol. 1371, pp. 203–215, https://doi.org/10.1007/978-981-16-1685-3_17
Peluso S, Romano P, Quaglia F (2012) Score: a scalable one-copy serializable partial replication protocol. In: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, Springer, pp. 456–475
Schiper N, Sutra P, Pedone F (2010) P-store: genuine partial replication in wide area networks. In: 2010 29th IEEE Symposium on Reliable Distributed Systems, IEEE, pp. 214–224
Kalavri V, Vlassov V, Haridi S (2017) High-level programming abstractions for distributed graph processing. IEEE Trans Knowl Data Eng 30(2):305–324
Murtagh F, Contreras P (2017) Algorithms for hierarchical clustering: an overview, ii. Wiley Interdiscipl Rev Data Mining Knowl Discov 7(6):e1219
Day WH, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1(1):7–24
Shahapure KR, Nicholas C (2020) Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 747–748
Wang X, Xu Y (2019) An improved index for clustering validation based on silhouette index and calinski-harabasz index. In: IOP Conference Series: Materials Science and Engineering, IOP Publishing, vol. 569, p. 052024
Paranjape A, Benson AR, Leskovec J (2017) Motifs in temporal networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 601–610
Leskovec J, Mcauley J (2012) Learning to discover social circles in ego networks. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates Inc., Red Hook
Yang J, Leskovec J (2012) Defining and evaluating network communities based on ground-truth. In: Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pp. 1–8
Besta M, Podstawski M, Groner L, Solomonik E, Hoefler T (2017) To push or to pull: On reducing communication and synchronization in graph computations. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp. 93–104
Chatterjee M, Mitra A, Setua SK, Roy S (2020) Gossip-based fault-tolerant load balancing algorithm with low communication overhead. Comput Electr Eng 81:106517
Acknowledgements
AM is a Senior Research Fellow supported by the Visvesvaraya Ph.D. Scheme for Electronics and IT, under Ministry of Electronics and Information Technology, Government of India. NC acknowledges the DST, ICPS project grant T-884 on “Connected Smart Health Services for Rural India”.
Funding
This work has not received any funding.
Author information
Authors and Affiliations
Contributions
Conceptualization of Methodology: AM, RC, NC. Data Curation, Data Analysis, Formal analysis, Visualization, Investigation, Implementation, Validation, Original draft preparation: AM. Methodology Validation, Reviewing and Editing: RC, NC. Overall Supervision: RC, NC.F
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mukherjee, A., Chaki, R. & Chaki, N. An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems. J Supercomput 79, 9383–9408 (2023). https://doi.org/10.1007/s11227-022-05028-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-05028-8