More Web Proxy on the site http://driver.im/

research-article

sPIN: High-performance streaming Processing In the Network

Authors:

Torsten Hoefler,

Salvatore Di Girolamo,

Konstantin Taranov,

Ron BrightwellAuthors Info & Claims

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 59, Pages 1 - 16

https://doi.org/10.1145/3126908.3126970

Published: 12 November 2017 Publication History

Abstract

Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator Log-GOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.

References

[1]

Shawn Hansen and Sujal Das. 2006. Fabric-agnostic RDMA with OpenFabrics Enterprise Distribution: Promises, Challenges, and Future Direction. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06). ACM, New York, NY, USA, Article 23.

Digital Library

[2]

Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society, Article 103, 9 pages. http://dl.acm.org/citation.cfm?id=2388996.2389136

Digital Library

[3]

Brian W Barrett, Ronald Brightwell, Ryan E. Grant, Scott Hemmert, Kevin T Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Arthur B. Maccabe, and Trammell Hudson. 2017. The Portals 4.1 network programming interface. Technical Report. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States).

[4]

Antoine Kaufmann, SImon Peter, Naveen Kr. Sharma, Thomas Anderson, and Arvind Krishnamurthy. 2016. High Performance Packet Processing with FlexNIC. SIGPLAN Not. 51, 4 (March 2016), 67--81.

Digital Library

[5]

Ethernet Alliance. 2015. 2015 Ethernet Roadmap. (2015).

[6]

Daniel Molka, Daniel Hackenberg, and Robert Schöne. 2014. Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer. In Proceedings of the Workshop on Memory Systems Performance and Correctness (MSPC '14). ACM, New York, NY, USA, Article 4, 10 pages.

Digital Library

[7]

Intel Corporation. 2016. Intel 64 and IA-32 Architectures Optimization Reference Manual. (July 2016).

[8]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA. Queue 6, 2 (March 2008), 40--53.

Digital Library

[9]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test 12, 3 (May 2010), 66--73.

[10]

M. G. Venkata, R. L. Graham, J. S. Ladd, P. Shamis, I. Rabinovitz, V. Filipov, and G. Shainer. 2011. ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 781--787.

Digital Library

[11]

B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. 2010. The PERCS High-Performance Interconnect. In Proceedings of 18th Symposium on High-Performance Interconnects (Hot Interconnects 2010). IEEE.

Digital Library

[12]

K Scott Hemmert, Brian Barrett, and Keith D Underwood. 2010. Using triggered operations to offload collective communication operations. In European MPI Users' Group Meeting. Springer, 249--256.

Digital Library

[13]

K. Rupp, F. Rudolf, and J. Weinbub. 2010. ViennaCL - A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. In Intl. Workshop on GPUs and Scientific Applications. 51--56.

[14]

Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. 1992. Active Messages: A Mechanism for Integrated Communication and Computation. SIGARCH Comput. Archit. News 20, 2 (April 1992), 256--266.

Digital Library

[15]

Ada Gavrilovska. SPLITS Stream Handlers: Deploying Application-level Services to Attached Network Processor. Ph.D. Dissertation. Georgia Institute of Technology.

Digital Library

[16]

Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and others. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87--95.

Digital Library

[17]

Atos Technologies. 2016. Bull eXascale Interconnect in sequana. (2016).

[18]

S. Di Girolamo, P. Jolivet, K. D. Underwood, and T. Hoefler. 2015. Exploiting Offload Enabled Network Interfaces. In Proceedings of the 23rd Annual Symposium on High-Performance Interconnects (HOTI'15). IEEE.

Digital Library

[19]

T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 597--604.

Digital Library

[20]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7.

Digital Library

[21]

T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10).

Digital Library

[22]

T. Hoefler, T. Schneider, and A. Lumsdaine. 2009. The Effect of Network Noise on Large-Scale Collective Communications. Parallel Processing Letters (PPL) 19, 4 (Aug. 2009), 573--593.

[23]

Mellanox Technologies. 2015. EDR InfiniBand. Jan. 2015). Open Fabrics User's Meeting 2015.

[24]

F. A. Endo, D. CouroussÃl', and H. P. Charles. 2014. Micro-architectural simulation of in-order and out-of-order ARM microprocessors with gem5. In 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV). 266--273.

Digital Library

[25]

Keith D. Underwood, Jerrie Coffman, Roy Larsen, K. Scott Hemmert, Brian W. Barrett, Ron Brightwell, and Michael Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects (HOTI '11). IEEE Computer Society, Washington, DC, USA, 35--42.

Digital Library

[26]

Ayaz Akram and Lina Sawalha. 2016. x86 computer architecture simulators: A comparative study. In Computer Design (ICCD), 2016 IEEE 34th International Conference on. IEEE, 638--645.

[27]

B. v. Werkhoven, J. Maassen, F. J. Seinstra, and H. E. Bal. 2014. Performance Models for CPU-GPU Data Transfers. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 11--20.

[28]

Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, and Torsten Hoefler. 2016. A PCIe Congestion-aware Performance Model for Densely Populated Accelerator Servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 63, 11 pages. http://dl.acm.org/citation.cfm?id=3014904.3014989

Digital Library

[29]

Duncan Roweth and Ashley Pittman. 2005. Optimised Global Reduction on QsNet II. In Proceedings of the 13th Symposium on High Performance Interconnects (HOTI '05). IEEE Computer Society, Washington, DC, USA, 23--28.

Digital Library

[30]

T. Hoefler and D. Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations 1, 2 (Oct. 2014), 58--75.

Digital Library

[31]

Tim S. Woodall, Galen M. Shipman, George Bosilca, Richard L. Graham, and Arthur B. Maccabe. 2006. High Performance RDMA Protocols in HPC. In Proceedings of the 13th European PVM/MPI User's Group Conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI'06). Springer-Verlag, Berlin, Heidelberg, 76--85.

Digital Library

[32]

T. Hoefler and A. Lumsdaine. 2008. Message Progression in Parallel Computing - To Thread or not to Thread?. In Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society.

[33]

Brian W Barrett, Ron Brightwell, K Scott Hemmert, Kyle B Wheeler, and Keith D Underwood. 2011. Using triggered operations to offload rendezvous messages. In European MPI Users' Group Meeting. Springer, 120--129.

Digital Library

[34]

Claude Bernard, Michael C Ogilvie, Thomas A DeGrand, Carleton E DeTar, Steven A Gottlieb, A Krasnitz, Robert L Sugar, and Doug Toussaint. 1991. Studying quarks and gluons on MIMD parallel computers. The International Journal of Supercomputing Applications 5, 4 (1991), 61--70.

Digital Library

[35]

Philip W Jones, Patrick H Worley, Yoshikatsu Yoshida, JB White, and John Levesque. 2005. Practical performance portability in the Parallel Ocean Program (POP). Concurrency and Computation: Practice and Experience 17, 10 (2005), 1317--1327.

Digital Library

[36]

Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.

[37]

Hyong-youb Kim, Vijay S. Pai, and Scott Rixner. 2003. Exploiting Task-level Concurrency in a Programmable Network Interface. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '03). ACM, New York, NY, USA, 61--72.

Digital Library

[38]

T. Schneider, R. Gerstenberger, and T. Hoefler. 2012. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In Recent Advances in the Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings, Vol. 7490. Springer, 121--131.

Digital Library

[39]

Matthias Weber. High availability for the lustre file system. Ph.D. Dissertation. Oak Ridge National Laboratory.

[40]

John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru Parulkar, Mendel Rosenblum, and others. 2010. The case for RAMClouds: scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review 43, 4 (2010), 92--105.

Digital Library

[41]

Storage Performance Council. 2002. SPC Trace File Format Specification, Revision 1.0.1. (2002).

[42]

Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux journal 2004, 124 (2004), 5.

Digital Library

[43]

Marius Poke and Torsten Hoefler. 2015. DARE: High-Performance State Machine Replication on RDMA Networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15). ACM, New York, NY, USA, 107--118.

Digital Library

[44]

Ciprian Docan, Manish Parashar, and Scott Klasky. 2010. DataSpaces: An Interaction and Coordination Framework for Coupled Simulation Workflows. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 25--36.

Digital Library

[45]

Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In Proceedings of the 11th USENNIX Conference on Networked Systems Design and Implementation (NSDI'14). USENIX Association, Berkeley, CA, USA, 401--414. http://dl.acm.org/citation.cfm?id=2616448.2616486

Digital Library

[46]

Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 54--70.

Digital Library

[47]

V. Santhosh Kumar, M. J. Thazhuthaveetil, and R. Govindarajan. 2006. Exploiting Programmable Network Interfaces for Parallel Query Execution in Workstation Clusters. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (IPDPS'06). IEEE Computer Society, Washington, DC, USA, 77--77. http://dl.acm.org/citation.cfm?id=1898953.1899010

Digital Library

[48]

Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional Memory: Architectural Support for Lock-free Data Structures. SIGARCH Comput. Archit. News 21, 2 (May 1993), 289--300.

Digital Library

[49]

Darius Buntinas. 2012. Scalable distributed consensus to support MPI fault tolerance. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 1240--1249.

Digital Library

[50]

Thara Angskun, George Bosilca, and Jack Dongarra. 2007. Binomial graph: A scalable and fault-tolerant logical network topology. In International Symposium on Parallel and Distributed Processing and Applications. Springer, 471--482.

Digital Library

[51]

Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. 2002. The Quadrics network: High-performance clustering technology. IEEE Micro 22, 1 (2002), 46--57.

Digital Library

[52]

W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda. 2004. Efficient and scalable barrier over Quadrics and Myrinet with a new NIC-based collective message passing protocol. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. 182--.

[53]

Ron Brighttwell Kevin T. Pedretti. 2004. A NIC-Offload Implementation of Portals for Quadrics QsNet. In Fifth LCI International Conference on Linux Clusters.

[54]

A. Wagner, Hyun-Wook Jin, D. K. Panda, and R. Riesen. 2004. NIC-based offload of dynamic user-defined modules for Myrinet clusters. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935). 205--214.

Digital Library

Cited By

Graham RBosilca GQin YSettlemyer BShainer GStunkel CVallee GWilliams BCisneros-Stoianowski GOhlmann SRampp M(2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528935
Schonbein WMatsika TGrant R(2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00022
Zyla KLiess MWild THerkersdorf A(2024)FlexRoute: A Fast, Flexible and Priority-Aware Packet-Processing Design2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00016(52-59)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00016
Show More Cited By

Recommendations

SPIN: seamless operating system integration of peer-to-peer DMA between SSDs and GPUs
USENIX ATC '17: Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference

Recent GPUs enable Peer-to-Peer Direct Memory Access (P2P) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using P2P to access files is challenging because of the subtleties ...
SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs

Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties ...
Towards a GPGPU-parallel SPIN model checker
SPIN 2014: Proceedings of the 2014 International SPIN Symposium on Model Checking of Software

As General-Purpose Graphics Processing Units (GPGPUs)become more powerful, they are being used increasingly often in high-performance computing applications. State space exploration, as employed in model-checking and other verification techniques, is a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2017

801 pages

ISBN:9781450351140

DOI:10.1145/3126908

General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN

Copyright © 2017 ACM.

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
794
Total Downloads

Downloads (Last 12 months)80
Downloads (Last 6 weeks)9

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Graham RBosilca GQin YSettlemyer BShainer GStunkel CVallee GWilliams BCisneros-Stoianowski GOhlmann SRampp M(2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528935
Schonbein WMatsika TGrant R(2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00022
Zyla KLiess MWild THerkersdorf A(2024)FlexRoute: A Fast, Flexible and Priority-Aware Packet-Processing Design2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00016(52-59)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00016
Zyla KLiess MWild THerkersdorf A(2024)FlexCross: High-Speed and Flexible Packet Processing via a Crosspoint-Queued Crossbar2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00022(98-105)Online publication date: 28-Aug-2024
https://doi.org/10.1109/DSD64264.2024.00022
Parizotto RCoelho BNunes DHaque ISchaeffer-Filho A(2023)Offloading Machine Learning to Programmable Data Planes: A Systematic SurveyACM Computing Surveys10.1145/360515356:1(1-34)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3605153
Chrapek MKhalilov MHoefler TMohror KArnold DBadia R(2023)HEAR: Homomorphically Encrypted AllreduceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607099(1-17)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607099
Guo AHao YWu CHaghi PPan ZSi MTao DLi AHerbordt MGeng TGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and TrainingProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593724(336-347)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593724
Wang RDong DLei FMa JWu KLu KGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593711
Zyla KLiess MWild THerkersdorf A(2023)FlexPipe: Fast, Flexible and Scalable Packet Processing for High-Performance SmartNICs2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)10.1109/VLSI-SoC57769.2023.10321933(1-6)Online publication date: 16-Oct-2023
https://doi.org/10.1109/VLSI-SoC57769.2023.10321933
Oliveira RGavrilovska A(2023)Comprex: In-Network Compression for Accelerating IoT Analytics at ScaleIEEE Micro10.1109/MM.2023.334349844:2(20-30)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1109/MM.2023.3343498
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents