[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3458817.3476211acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

FedAT: a high-performance and communication-efficient federated learning system with asynchronous tiers

Published: 13 November 2021 Publication History

Abstract

Federated learning (FL) involves training a model over massive distributed devices, while keeping the training data localized and private. This form of collaborative learning exposes new tradeoffs among model convergence speed, model accuracy, balance across clients, and communication cost, with new challenges including: (1) straggler problem---where clients lag due to data or (computing and network) resource heterogeneity, and (2) communication bottleneck---where a large number of clients communicate their local updates to a central server and bottleneck the server. Many existing FL methods focus on optimizing along only one single dimension of the tradeoff space. Existing solutions use asynchronous model updating or tiering-based, synchronous mechanisms to tackle the straggler problem. However, asynchronous methods can easily create a communication bottleneck, while tiering may introduce biases that favor faster tiers with shorter response latencies.
To address these issues, we present FedAT, a novel Federated learning system with Asynchronous Tiers under Non-i.i.d. training data. FedAT synergistically combines synchronous, intra-tier training and asynchronous, cross-tier training. By bridging the synchronous and asynchronous training through tiering, FedAT minimizes the straggler effect with improved convergence speed and test accuracy. FedAT uses a straggler-aware, weighted aggregation heuristic to steer and balance the training across clients for further accuracy improvement. FedAT compresses uplink and downlink communications using an efficient, polyline-encoding-based compression algorithm, which minimizes the communication cost. Results show that FedAT improves the prediction performance by up to 21.09% and reduces the communication cost by up to 8.5×, compared to state-of-the-art FL methods.

Supplementary Material

MP4 File (FedAT A High Performance and Communication-Efficient Federated Learning System With Asynchronous Tiers 232 Afternoon 3.mp4)
Presentation video

References

[1]
Chameleon Cloud: A configurable experimental environment for large-scale cloud research. https://www.chameleoncloud.org/.
[2]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (2016), pp. 265--283.
[3]
Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signsgd: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434 (2018).
[4]
Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Konecny, J., Mazzocchi, S., McMahan, H. B., et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046 (2019).
[5]
Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. Siam Review 60, 2 (2018), 223--311.
[6]
Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Konečnỳ, J., McMahan, H. B., Smith, V., and Talwalkar, A. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018).
[7]
Chai, Z., Ali, A., Zawad, S., Truex, S., Anwar, A., Baracaldo, N., Zhou, Y., Ludwig, H., Yan, F., and Cheng, Y. Tifl: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2020), p. 125--136.
[8]
Chai, Z., Fayyaz, H., Fayyaz, Z., Anwar, A., Zhou, Y., Baracaldo, N., Ludwig, H., and Cheng, Y. Towards taming the resource and data heterogeneity in federated learning. In 2019 {USENIX} Conference on Operational Machine Learning (OpML 19) (2019), pp. 19--21.
[9]
Chen, Y., Ning, Y., and Rangwala, H. Asynchronous online federated learning for edge devices. arXiv preprint arXiv:1911.02134 (2019).
[10]
Chen, Y., Sun, X., and Jin, Y. Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation. IEEE Transactions on Neural Networks and Learning Systems (2019).
[11]
Fan, W., Lu, P., Yu, W., Xu, J., Yin, Q., Luo, X., Zhou, J., and Jin, R. Adaptive asynchronous parallelization of graph algorithms. ACM Transactions on Database Systems (TODS) 45, 2 (2020), 1--45.
[12]
Go, A., Bhayani, R., and Huang, L. Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1, 12 (2009), 2009.
[13]
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211 (2013).
[14]
Han, M., and Daudjee, K. Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. Proceedings of the VLDB Endowment 8, 9 (2015), 950--961.
[15]
Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage, D. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018).
[16]
Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780.
[17]
Jeong, E., Oh, S., Kim, H., Park, J., Bennis, M., and Kim, S.-L. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479 (2018).
[18]
Keahey, K., Anderson, J., Zhen, Z., Riteau, P., Ruth, P., Stanzione, D., Cevik, M., Colleran, J., Gunawi, H. S., Hammock, C., Mambretti, J., Barnes, A., Halbach, F., Rocha, A., and Stubbs, J. Lessons learned from the chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association, July 2020.
[19]
Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[20]
Koloskova, A., Stich, S. U., and Jaggi, M. Decentralized stochastic optimization and gossip algorithms with compressed communication. arXiv preprint arXiv:1902.00340 (2019).
[21]
Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
[22]
Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images.
[23]
Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 3 (2020).
[24]
Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127 (2018).
[25]
Li, T., Sanjabi, M., Beirami, A., and Smith, V. Fair resource allocation in federated learning. In International Conference on Learning Representations (2019).
[26]
Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations (2019).
[27]
Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (2018), PMLR, pp. 3043--3052.
[28]
Lu, Y., Huang, X., Dai, Y., Maharjan, S., and Zhang, Y. Differentially private asynchronous federated learning for mobile edge computing in urban informatics. IEEE Transactions on Industrial Informatics 16, 3 (2019), 2134--2143.
[29]
Ludwig, H., Baracaldo, N., Thomas, G., Zhou, Y., Anwar, A., Rajamoni, S., Ong, Y., Radhakrishnan, J., Verma, A., Sinn, M., et al. Ibm federated learning: an enterprise framework white paper v0. 1. arXiv preprint arXiv:2007.10987 (2020).
[30]
McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (2017), pp. 1273--1282.
[31]
Mills, J., Hu, J., and Min, G. Communication-efficient federated learning for wireless edge intelligence in iot. IEEE Internet of Things Journal (2019).
[32]
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning (2016), pp. 1928--1937.
[33]
Nesterov, Y. Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media, 2013.
[34]
Nishio, T., and Yonetani, R. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE International Conference on Communications (ICC) (2019), IEEE, pp. 1--7.
[35]
O'herrin, J. K., Fost, N., and Kudsk, K. A. Health insurance portability accountability act (hipaa) regulations: effect on medical record research. Annals of surgery 239, 6 (2004), 772.
[36]
Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics (2020), pp. 2021--2031.
[37]
Reisizadeh, A., Mokhtari, A., Hassani, H., and Pedarsani, R. An exact quantized decentralized gradient descent algorithm. IEEE Transactions on Signal Processing 67, 19 (2019), 4934--4947.
[38]
Reisizadeh, A., Tziotis, I., Hassani, H., Mokhtari, A., and Pedarsani, R. Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity, 2020.
[39]
Sattler, F., Wiedemann, S., Müller, K.-R., and Samek, W. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems (2019).
[40]
Sattler, F., Wiedemann, S., Müller, K.-R., and Samek, W. Sparse binary compression: Towards distributed deep learning with minimal communication. In 2019 International Joint Conference on Neural Networks (IJCNN) (2019), IEEE.
[41]
Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. Federated multitask learning. In Advances in Neural Information Processing Systems (2017), pp. 4424--4434.
[42]
Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. Federated multi-task learning. In Advances in Neural Information Processing Systems (2017), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30, Curran Associates, Inc.
[43]
Tankard, C. What the gdpr means for businesses. Network Security 2016.
[44]
Wang, J., and Joshi, G. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576 (2018).
[45]
Wang, J., Sahu, A. K., Yang, Z., Joshi, G., and Kar, S. Matcha: Speeding up decentralized sgd via matching decomposition sampling. arXiv preprint arXiv:1905.09435 (2019).
[46]
Woodworth, B. E., Wang, J., Smith, A., McMahan, B., and Srebro, N. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in neural information processing systems (2018), pp. 8496--8506.
[47]
Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
[48]
Xie, C., Koyejo, S., and Gupta, I. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934 (2019).
[49]
Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong, N., Ramage, D., and Beaufays, F. Applied federated learning: Improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903 (2018).
[50]
Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd for non-convex optimization with faster convergence and less communication. arXiv preprint arXiv:1807.06629 2, 4 (2018), 7.
[51]
Zhang, C., Li, S., Xia, J., Wang, W., Yan, F., and Liu, Y. Batchcrypt: Efficient homomorphic encryption for cross-silo federated learning. In 2020 USENIX Annual Technical Conference (USENIX ATC 20) (July 2020), USENIX Association.
[52]
Zhang, S., Choromanska, A., and LeCun, Y. Deep learning with elastic averaging sgd. arXiv preprint arXiv:1412.6651 (2014).
[53]
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018).

Cited By

View all
  • (2025)Function Placement for In-network Federated LearningComputer Networks10.1016/j.comnet.2024.110900256(110900)Online publication date: Jan-2025
  • (2024)Federated Learning with Efficient Aggregation via Markov Decision Process in Edge NetworksMathematics10.3390/math1206092012:6(920)Online publication date: 20-Mar-2024
  • (2024)Communication Efficiency and Non-Independent and Identically Distributed Data Challenge in Federated Learning: A Systematic Mapping StudyApplied Sciences10.3390/app1407272014:7(2720)Online publication date: 24-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2021
1493 pages
ISBN:9781450384421
DOI:10.1145/3458817
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. asynchronous distributed learning
  2. communication efficiency
  3. federated learning
  4. tiering
  5. weighted aggregation

Qualifiers

  • Research-article

Funding Sources

Conference

SC '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)590
  • Downloads (Last 6 weeks)99
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Function Placement for In-network Federated LearningComputer Networks10.1016/j.comnet.2024.110900256(110900)Online publication date: Jan-2025
  • (2024)Federated Learning with Efficient Aggregation via Markov Decision Process in Edge NetworksMathematics10.3390/math1206092012:6(920)Online publication date: 20-Mar-2024
  • (2024)Communication Efficiency and Non-Independent and Identically Distributed Data Challenge in Federated Learning: A Systematic Mapping StudyApplied Sciences10.3390/app1407272014:7(2720)Online publication date: 24-Mar-2024
  • (2024)Federated Learning Security and Privacy-Preserving Algorithm and Experiments Research Under Internet of Things Critical InfrastructureTsinghua Science and Technology10.26599/TST.2023.901000729:2(400-414)Online publication date: Apr-2024
  • (2024)A multi-agent adaptive deep learning framework for online intrusion detectionCybersecurity10.1186/s42400-023-00199-07:1Online publication date: 1-May-2024
  • (2024)FedCaSe: Enhancing Federated Learning with Heterogeneity-aware Caching and SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698559(52-68)Online publication date: 20-Nov-2024
  • (2024)On the Impact of Heterogeneity on Federated Learning at the Edge with DGA Malware DetectionProceedings of the Asian Internet Engineering Conference 202410.1145/3674213.3674215(10-17)Online publication date: 9-Aug-2024
  • (2024)RoleML: a Role-Oriented Programming Model for Customizable Distributed Machine Learning on EdgesProceedings of the 25th International Middleware Conference10.1145/3652892.3700765(279-291)Online publication date: 2-Dec-2024
  • (2024)Accelerating Asynchronous Federated Learning Convergence via Opportunistic Mobile RelayingIEEE Transactions on Vehicular Technology10.1109/TVT.2024.338406173:7(10668-10680)Online publication date: Jul-2024
  • (2024)Multi-Attribute Auction-Based Grouped Federated LearningIEEE Transactions on Services Computing10.1109/TSC.2024.338773417:3(1056-1071)Online publication date: May-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media