An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems

Aradhita Mukherjee¹,
Rituparna Chaki² &
Nabendu Chaki¹

276 Accesses
Explore all metrics

Abstract

Big data applications based on graphs need to be scalable enough for handling immense growth in size of graphs, efficiently. Scalable graph processing typically handles the high workload by increasing the number of computing nodes. However, this increases the chances of single or multiple node (multi-node) failures. Failures may occur during normal job execution, as well as during recovery. Most of the systems for failure detection either follow checkpoint-based recovery which has high computation cost, or follows replication that has high memory overhead. In this work, we have proposed an unsupervised learning-based failure-recovery scheme for graph processing systems that detects different kinds of failures and allows node recovery within a shorter amount of time. It has been able to provide enhanced performance as compared to traditional failure-recovery models with respect to simultaneous recovery from single and multi-node failures, memory overload and computational latency. Evaluating its performance on four benchmark datasets has reinforced its strength and makes the proposed model completely fit in with the status quo.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Article 09 March 2017

Load Balancing and Fault Tolerance Mechanisms for Scalable and Reliable Big Data Analytics

Data availability

Datasets 1 and 2 can be freely downloaded from https://snap.stanford.edu/data/email-Eu-core-temporal.html. Dataset 3 is available at https://snap.stanford.edu/data/ego-Facebook.html. Dataset 4 can be downloaded from https://snap.stanford.edu/data/com-LiveJournal.html.

Code availability

The proposed failure recovery mechanism has been implemented in Python3 and is freely available at https://github.com/aradhita1988/Failure-recovery.

References

Huang J, Qin W, Wang X, Chen W (2020) Survey of external memory large-scale graph processing on a multi-core system. J Supercomput 76(1):549–579
Article Google Scholar
Chen R, Yao Y, Wang P, Zhang K, Wang Z, Guan H, Zang B, Chen H (2017) Replication-based fault-tolerance for large-scale graph processing. IEEE Trans Parallel Distrib Syst 29(7):1621–1635
Article Google Scholar
Le QV (2013) Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp. 8595–8598
Dobre C, Xhafa F (2014) Parallel programming paradigms and frameworks in big data era. Int J Parallel Prog 42(5):710–738
Article Google Scholar
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. In: 10th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 12), pp. 17–30
Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning in the cloud. arXiv preprint arXiv:1204.6078
Lu W, Shen Y, Wang T, Zhang M, Jagadish HV, Du X (2018) Fast failure recovery in vertex-centric distributed graph processing systems. IEEE Trans Knowl Data Eng 31(4):733–746
Article Google Scholar
Zhao Y, Yoshigoe K, Xie M, Bian J, Xiong K (2020) L-powergraph: a lightweight distributed graph-parallel communication mechanism. J Supercomput 76(3):1850–1879
Article Google Scholar
Shen Y, Chen G, Jagadish H, Lu W, Ooi BC, Tudor BM (2014) Fast failure recovery in distributed graph processing systems. Proc VLDB Endow 8(4):437–448
Article Google Scholar
Margo D, Seltzer M (2015) A scalable distributed graph partitioner. Proc VLDB Endow 8(12):1478–1489
Article Google Scholar
Robinson DC, Hand JA, Madsen MB, McKelvey KR (2018) The Dat Project, an open and decentralized research data tool. Scientific data 5(1):1–4
Article Google Scholar
Blähser J, Göller T, Böhmer M (2021) Thine-approach for a fault tolerant distributed packet manager based on hypercore protocol. In: 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), IEEE, pp. 1778–1782
Robinson DC, Hand JA, Madsen MB, McKelvey KR (2018) The dat project, an open and decentralized research data tool. Sci Data 5:180221. https://doi.org/10.1038/sdata.2018.221
Article Google Scholar
Tarr D, Lavoie E, Meyer A, Tschudin C (2019) Secure scuttlebutt: an identity-centric protocol for subjective and decentralized applications. In: Proceedings of the 6th ACM Conference on Information-Centric Networking, pp. 1–11
Tsipenyuk GY (2018) Evaluation of decentralized email architecture and social network analysis based on email attachment sharing. Tech. rep., University of Cambridge, Computer Laboratory, https://doi.org/10.17863/CAM.21035
Sandoval IV, Atashpendar A, Lenzini G, Ryan PY (2021) Pakemail: authentication and key management in decentralized secure email and messaging via pake. arXiv preprint arXiv:2107.06090
Kermarrec AM, Lavoie E, Tschudin C (2020) Gossiping with append-only logs in secure-scuttlebutt. In: Proceedings of the 1st International Workshop on Distributed Infrastructure for Common Good, pp. 19–24
Paul HS, Gupta A, Sharma A (2006) Finding a suitable checkpoint and recovery protocol for a distributed application. J Parallel Distrib Comput 66(5):732–749
Article MATH Google Scholar
Dathathri R, Gill G, Hoang L, Pingali K (2019) Phoenix: a substrate for resilient distributed graph analytics. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 615–630
Tschudin C (2019) A broadcast-only communication model based on replicated append-only logs. ACM SIGCOMM Comput Commun Rev 49(2):37–43
Article Google Scholar
Singh A, Ngan TW, Druschel P, Wallach DS (2006) Eclipse attacks on overlay networks: threats and defenses. In: Proceedings IEEE INFOCOM 2006 25TH IEEE International Conference on Computer Communications, pp. 1–12
Roy C, Chakraborty D, Debnath S, Mukherjee A, Chaki N (2021) Single failure recovery in distributed social network. In: Hong T, Wojtkiewicz K, Chawuthai R, Sitek P (eds) Recent Challenges in Intelligent Information and Database Systems - 13th Asian Conference, ACIIDS 2021, Phuket, Thailand, April 7-10, 2021, Proceedings, Springer, Communications in Computer and Information Science, vol. 1371, pp. 203–215, https://doi.org/10.1007/978-981-16-1685-3_17
Peluso S, Romano P, Quaglia F (2012) Score: a scalable one-copy serializable partial replication protocol. In: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, Springer, pp. 456–475
Schiper N, Sutra P, Pedone F (2010) P-store: genuine partial replication in wide area networks. In: 2010 29th IEEE Symposium on Reliable Distributed Systems, IEEE, pp. 214–224
Kalavri V, Vlassov V, Haridi S (2017) High-level programming abstractions for distributed graph processing. IEEE Trans Knowl Data Eng 30(2):305–324
Article Google Scholar
Murtagh F, Contreras P (2017) Algorithms for hierarchical clustering: an overview, ii. Wiley Interdiscipl Rev Data Mining Knowl Discov 7(6):e1219
Article Google Scholar
Day WH, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1(1):7–24
Article MATH Google Scholar
Shahapure KR, Nicholas C (2020) Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 747–748
Wang X, Xu Y (2019) An improved index for clustering validation based on silhouette index and calinski-harabasz index. In: IOP Conference Series: Materials Science and Engineering, IOP Publishing, vol. 569, p. 052024
Paranjape A, Benson AR, Leskovec J (2017) Motifs in temporal networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 601–610
Leskovec J, Mcauley J (2012) Learning to discover social circles in ego networks. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates Inc., Red Hook
Google Scholar
Yang J, Leskovec J (2012) Defining and evaluating network communities based on ground-truth. In: Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pp. 1–8
Besta M, Podstawski M, Groner L, Solomonik E, Hoefler T (2017) To push or to pull: On reducing communication and synchronization in graph computations. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp. 93–104
Chatterjee M, Mitra A, Setua SK, Roy S (2020) Gossip-based fault-tolerant load balancing algorithm with low communication overhead. Comput Electr Eng 81:106517
Article Google Scholar

Download references

Acknowledgements

AM is a Senior Research Fellow supported by the Visvesvaraya Ph.D. Scheme for Electronics and IT, under Ministry of Electronics and Information Technology, Government of India. NC acknowledges the DST, ICPS project grant T-884 on “Connected Smart Health Services for Rural India”.

Funding

This work has not received any funding.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Calcutta, JD - II, Sector III, Salt Lake City, Calcutta, 700106, India
Aradhita Mukherjee & Nabendu Chaki
A. K. Choudhury School of Information Technology, University of Calcutta, JD - II, Sector III, Salt Lake City, Calcutta, 700106, India
Rituparna Chaki

Authors

Aradhita Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Rituparna Chaki
View author publications
You can also search for this author in PubMed Google Scholar
Nabendu Chaki
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization of Methodology: AM, RC, NC. Data Curation, Data Analysis, Formal analysis, Visualization, Investigation, Implementation, Validation, Original draft preparation: AM. Methodology Validation, Reviewing and Editing: RC, NC. Overall Supervision: RC, NC.F

Corresponding author

Correspondence to Aradhita Mukherjee.

Ethics declarations

Conflict of interest

The authors declare no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mukherjee, A., Chaki, R. & Chaki, N. An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems. J Supercomput 79, 9383–9408 (2023). https://doi.org/10.1007/s11227-022-05028-8

Download citation

Accepted: 29 December 2022
Published: 13 January 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11227-022-05028-8

An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Load Balancing and Fault Tolerance Mechanisms for Scalable and Reliable Big Data Analytics

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Load Balancing and Fault Tolerance Mechanisms for Scalable and Reliable Big Data Analytics

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation