More Web Proxy on the site http://driver.im/

research-article

Roar: A Router Microarchitecture for In-network Allreduce

Authors:

Kai LuAuthors Info & Claims

ICS '23: Proceedings of the 37th International Conference on Supercomputing

Pages 423 - 436

https://doi.org/10.1145/3577193.3593711

Published: 21 June 2023 Publication History

Abstract

The allreduce operation is the most commonly used collective operation in distributed or parallel applications. It aggregates data collected from distributed hosts and broadcasts the aggregated result back to them. In-network computing can accelerate allreduce by offloading this operation into network devices. However, existing in-network solutions face the challenge of high throughput, performance of aggregating large message and producing repeatable results. In this work, we propose a simple and effective router microarchitecture for in-network allreduce, which uses an RDMA protocol to improve its throughput. We further discuss strategies to tackle the aforementioned challenges. Our approach not only shows advantages in comparison with the state-of-the-art in-network solutions, but also accelerates allreduce at a near-optimal level compared to host-based algorithms, as demonstrated through experiments.

References

[1]

2014. Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 (IP routable RoCE). https://www.infinibandta.org/specs

[2]

Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji Uchida, et al. 2018. The tofu interconnect d. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 646--654.

[3]

Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).

[4]

Keren Bergman. 2018. Empowering Flexible and Scalable High Performance Architectures with Embedded Photonics. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS '18). 378--378.

[5]

Roberto Bifulco and Gábor Rétvári. 2018. A survey on the programmable data plane: Abstractions, architectures, and open problems. In 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 1--7.

[6]

Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87--95.

Digital Library

[7]

Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, and Mark Horowitz. 2013. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 99--110.

Digital Library

[8]

Dong Chen, Noel A Eisley, Philip Heidelberger, Robert M Senger, Yutaka Sugawara, Sameer Kumar, Valentina Salapura, David L Satterfield, Burkhard Steinmacher-Burow, and Jeffrey J Parker. 2011. The IBM Blue Gene/Q interconnection network and message unit. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--10.

Digital Library

[9]

Yen-Huei Chen, Wei-Min Chan, Wei-Cheng Wu, Hung-Jen Liao, Kuo-Hua Pan, Jhon-Jhy Liaw, Tang-Hsuan Chung, Quincy Li, Chih-Yung Lin, Mu-Chi Chiang, et al. 2014. A 16 nm 128 Mb SRAM in High- k Metal-Gate FinFET Technology With Write-Assist Circuitry for Low-VMIN Applications. IEEE Journal of Solid-State Circuits 50, 1 (2014), 170--177.

[10]

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of mpi usage on a production supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 386--400.

Digital Library

[11]

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.

Digital Library

[12]

Salvatore Di Girolamo, Andreas Kurth, Alexandru Calotoiu, Thomas Benz, Timo Schneider, Jakub Beranek, Luca Benini, and Torsten Hoefler. 2021. A RISC-V in-network accelerator for flexible high-performance low-power packet processing. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 958--971.

Digital Library

[13]

InfiniBand Trade Association et al. 2000. InfiniBand architecture specification release 1.2. http://www.infinibandta.org

[14]

Aoxiang Feng, Dezun Dong, Fei Lei, Junchao Ma, Enda Yu, and Ruiqi Wang. 2022. In-network aggregation for data center networks: A survey. Computer Communications (2022).

[15]

Guangnan Feng, Dezun Dong, and Yutong Lu. 2022. Optimized MPI collective algorithms for dragonfly topology. In Proceedings of the 36th ACM International Conference on Supercomputing. 1--11.

Digital Library

[16]

M. P. Forum. 1994. MPI: A Message-Passing Interface Standard.

Digital Library

[17]

Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network aggregation for shared machine learning clusters. Proceedings of Machine Learning and Systems 3 (2021), 829--844.

[18]

Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, et al. 2016. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). IEEE, 1--10.

[19]

Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, et al. 2020. Scalable hierarchical aggregation and reduction protocol (SHARP) TM streaming-aggregation hardware design and evaluation. In High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22--25, 2020, Proceedings 35. Springer, 41--59.

Digital Library

[20]

Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E Grant, and Ron Brightwell. 2017. sPIN: High-performance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.

Digital Library

[21]

Intel. mar, 2021. Intel Tofino Series. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch.html

[22]

A Ishii, D Foley, E Anderson, B Dally, G Dearth, L Dennison, M Hummel, and J Schafer. 2018. NVSwitch and DGX-2. In Hot Chips.

[23]

Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2.

[24]

Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, D. E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate Network-on-Chip simulator. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '13).

[25]

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77--88.

Digital Library

[26]

Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996--1009.

Digital Library

[27]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael M Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In NSDI, Vol. 21. 741--761.

[28]

Charles E Leiserson. 1985. Fat-trees: universal networks for hardware-efficient supercomputing. IEEE transactions on Computers 100, 10 (1985), 892--901.

Digital Library

[29]

Shigang Li and Torsten Hoefler. 2022. Near-optimal sparse allreduce for distributed deep learning. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 135--149.

Digital Library

[30]

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In Proceedings of the 46th International Symposium on Computer Architecture. 279--291.

Digital Library

[31]

Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, and Guang Suo. 2015. High Performance Interconnect Network for Tianhe System. Journal of Computer Science & Technology (2015), 259--272. Issue No.2.

[32]

John DC Little. 1961. A proof for the queuing formula: L= Λ W. Operations research 9, 3 (1961), 383--387.

[33]

Shuo Liu, Qiaoling Wang, Junyi Zhang, Wenfei Wu, Qinliang Lin, Yao Liu, Meng Xu, Marco Canini, Ray CC Cheung, and Jianfei He. 2023. In-Network Aggregation with Transport Transparency for Distributed Training. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 376--391.

Digital Library

[34]

Junchao Ma, Dezun Dong, Cunlu Li, Ke Wu, and Liquan Xiao. 2021. PAARD: Proximity-Aware All-Reduce Communication for Dragonfly Networks. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 255--262.

[35]

Mellanox. mar 2019. Mellanox Quantum Switches. https://www.mellanox.com/products/infiniband-switches-ic/quantum

[36]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117--124.

Digital Library

[37]

Dan RK Ports and Jacob Nelson. 2019. When should the network be the computer?. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS '19). 209--215.

Digital Library

[38]

Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science, Vol. 3036. 1--9.

[39]

Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, and Tushar Krishna. 2022. Themis: A Network Bandwidth-aware Collective Scheduling Policy for Distributed Training of DL Models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA '22). 581--596.

Digital Library

[40]

Juan A Rico-Gallego, Juan C Díaz-Martín, Ravi Reddy Manumachu, and Alexey L Lastovetsky. 2019. A survey of communication performance models for high-performance computing. ACM Computing Surveys (CSUR) 51, 6 (2019), 1--36.

Digital Library

[41]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. 2019. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701 (2019).

[42]

Vishal Shrivastav. 2022. Stateful multi-pipelined programmable switches. In Proceedings of the ACM SIGCOMM 2022 Conference. 663--676.

Digital Library

[43]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.

Digital Library

[44]

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. 2020. A survey on distributed machine learning. ACM Computing Surveys (CSUR) 53, 2 (2020), 1--33.

Digital Library

[45]

U von Luxburg, S Bengio, HM Wallach, R Fergus, SVN Vishwanathan, and R Garnett. 2017. Baidu Research. https://github.com/baidu-research/baidu-allreduce.

[46]

Songtao Wang, Dan Li, Yang Cheng, Jinkun Geng, Yanshu Wang, Shuai Wang, Shutao Xia, and Jianping Wu. 2020. A scalable, high-performance, and fault-tolerant network architecture for distributed machine learning. IEEE/ACM Transactions on Networking 28, 04 (2020), 1752--1764.

Digital Library

[47]

Songtao Wang, Dan Li, Yang Cheng, Jinkun Geng, Yanshu Wang, Shuai Wang, Shu-Tao Xia, and Jianping Wu. 2018. BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training. Advances in Neural Information Processing Systems 31 (2018).

[48]

Mingran Yang, Alex Baban, Valery Kugel, Jeff Libby, Scott Mackie, Swamy Sadashivaiah Renu Kananda, Chang-Hong Wu, and Manya Ghobadi. 2022. Using trio: juniper networks' programmable chipset-for emerging in-network applications. In Proceedings of the ACM SIGCOMM 2022 Conference. 633--648.

Digital Library

[49]

Hesam Zolfaghari, Davide Rossi, Walter Cerroni, Hayate Okuhara, Carla Raffaelli, and Jari Nurmi. 2020. Flexible software-defined packet processing using low-area hardware. IEEE Access 8 (2020), 98929--98945.

Cited By

Peng JLi ZShi SLi B(2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673140
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Wu KDong DXu W(2024)A lightweight RDMA Connection Protocol based on Post-hoc ConfirmationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104991(104991)Online publication date: Oct-2024
https://doi.org/10.1016/j.jpdc.2024.104991
Show More Cited By

Index Terms

Roar: A Router Microarchitecture for In-network Allreduce

Recommendations

Simulation studies on router buffer sizing for short-lived and pacing TCP flows

Traditionally, the size of router buffers is determined by the bandwidth-delay product discipline (normal discipline), which is the product of the link bandwidth and average round-trip time (RTT) of flows passing through the router. However, recent ...
An asynchronous router with multicast support in NoC
CSS '07: Proceedings of the Fifth IASTED International Conference on Circuits, Signals and Systems

Network on Chip (NoC) is proposed as an alternative to bus designs, offering a scalable interconnect scheme for SoCs. However network communication does not always have an edge over bus communication particularly in broadcast mode. Therefore, there is a ...
On the Design of a Multigigabit IP Router

The emergence of gigabit speed networks hinges upon the existence of high performance internetworking units, such as IP routers. In this paper, we present an architecture and we discuss the implementation of a multigigabit IP router. For this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing

June 2023

505 pages

ISBN:9798400700569

DOI:10.1145/3577193

Chair:
Kyle Gallivan,
Co-chair:
Efstratios Gallopoulos,
Program Co-chairs:
Dimitrios S. Nikolopoulos,
Ramon Beivide

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '23

Sponsor:

SIGARCH

ICS '23: 37th International Conference on Supercomputing

June 21 - 23, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
466
Total Downloads

Downloads (Last 12 months)267
Downloads (Last 6 weeks)22

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Peng JLi ZShi SLi B(2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673140
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Wu KDong DXu W(2024)A lightweight RDMA Connection Protocol based on Post-hoc ConfirmationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104991(104991)Online publication date: Oct-2024
https://doi.org/10.1016/j.jpdc.2024.104991
Lu Z(2023)PiN: Processing in Network-on-ChipIEEE Design & Test10.1109/MDAT.2023.330794340:6(30-38)Online publication date: Dec-2023
https://doi.org/10.1109/MDAT.2023.3307943
Lan BLei FDong DWu KZhang X(2023)DFAR: Dynamic-threshold Fault-tolerant Adaptive Routing for Fat Tree Networks2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00110(721-728)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00110

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents