[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ISCA.2018.00048acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

DCS-ctrl: a fast and flexible device-control mechanism for device-centric server architecture

Published: 02 June 2018 Publication History

Abstract

Modern high-performance servers leverage a large number of emerging peripheral devices (e.g., data processing accelerators, non-volatile memory storage, high-bandwidth network cards) to meet ever-increasing performance demands of server applications. However, as such servers experience severe kernel overhead due to frequently invoked device operations (e.g., buffer management and data copy), server architects have proposed various hardware and software approaches to enable direct communications among the devices. Unfortunately, existing direct device-to-device (D2D) communication schemes still suffer from low performance and the lack of flexibility. First, software-based schemes depend on complicated kernel routines and necessitate multiple hardware-software and user-kernel boundary crossings, which significantly limit the performance improvement opportunities from direct D2D communications. On the other hand, hardware-based schemes require tight integration and custom-built devices, preventing architects from flexibly adding off-the-shelf devices.
In this paper, we propose DCS-ctrl, a novel Hardware-based Device-Control (HDC) mechanism for Device-Centric Server (DCS) architecture to provide fast and CPU-efficient direct D2D communications among a large number of off-the-shelf peripheral devices. The key idea of DCS-ctrl is to implement a low-cost and flexible device-control mechanism on an independent FPGA device called HDC Engine. As HDC Engine manages all data and control transfers among devices at the hardware level, the server achieves high performance, scalability, and flexibility. First, optimizing both data and control paths at the hardware level minimizes the latency of inter-device communications. Second, implementing FPGA-based reconfigurable device controllers enables direct D2D communications among commodity devices and thus improves per-device flexibility. Third, merging heterogeneous device operations with intermediate data processing supports creates more opportunities for direct inter-device communications in server applications. Our DCS-ctrl prototype reduces the latency of software-based direct D2D communications by 42% and the CPU utilization by 52%.

References

[1]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. luc Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-Datacenter Performance Analysis of a Tensor Processing Unit," in Proc. 44th International Symposium on Computer Architecture (ISCA), 2017.
[2]
A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," in Proc. 41st International Symposium on Computer Architecture (ISCA), 2014.
[3]
M. Harris, "NVIDIA DGX-1: The Fastest Deep Learning System," https://devblogs.nvidia.com/parallelforall/dgx-1-fastest-deep-learning-system/.
[4]
"Intel<sup>TM</sup> SSD Data Center Family for PCIe," https://www.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html.
[5]
"Samsung NVMe SSD 960 PRO/EVO," http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/ssd960.html.
[6]
"Mellanox Ethernet Adapters," http://www.mellanox.com/page/ethernet_cards_overview.
[7]
"Mellanox InfiniBand Adapters," http://www.mellanox.com/page/infiniband_cards_overview.
[8]
"NVIDIA Launches World's First High-Speed GPU Interconnect, Helping Pave the Way to Exascale Computing," http://nvidianews.nvidia.com/news/nvidia-launches-world-s-first-high-speed-gpu-interconnect-helping-pave-the-way-to-exascale-computing, 2014.
[9]
A. M. Caulfield, A. De, J. Coburn, T. I. Mollov, R. K. Gupta, and S. Swanson, "Moneta: A High-performance Storage Array Architecture for Next-generation, Non-volatile Memories," in Intl. Symp. on Microarchitecture (MICRO), Dec. 2010.
[10]
S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy, "MegaPipe: A New Programming Interface for Scalable Network I/O," in Proc. 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012.
[11]
A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris, "Improving Network Connection Locality on Multicore Systems," in Proc. 7th ACM European Conference on Computer Systems (EuroSys), 2012.
[12]
L. Soares and M. Stumm, "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls," in Proc. 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010.
[13]
J. Yang, D. B. Minturn, and F. Hady, "When Poll is Better than Interrupt," in Proc. 10th USENIX Conference on File and Storage Technologies (FAST), 2012.
[14]
A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, and S. Swanson, "Providing Safe, User Space Access to Fast, Solid State Disks," in Proc. 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.
[15]
E. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park, "mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems," in USENIX Symposium on Networked Systems Design and Implementation (NSDI), Apr. 2014.
[16]
J. Ahn, D. Kwon, Y. Kim, M. Ajdari, J. Lee, and J. Kim, "DCS: A Fast and Scalable Device-Centric Server Architecture," in Intl. Symp. on Microarchitecture (MICRO), Dec. 2015.
[17]
S. Bergman, T. Brokhman, T. Cohen, and M. Silberstein, "SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs," in USENIX Annual Technical Conference (ATC), July 2017.
[18]
H.-W. Tseng, Q. Zhao, Y. Zhou, M. Gahagan, and S. Swanson, "Morpheus: creating application objects efficiently for heterogeneous computing," in Intl. Symp. on Computer Architecture (ISCA), June 2016.
[19]
"Donard: A pcie peer-2-peer kernel patch and library that builds on top of nvm. express," https://github.com/sbates130272/donard.
[20]
A. M. Caulfield and S. Swanson, "QuickSAN: A Storage Area Network for Fast, Distributed, Solid State Disks," in Intl. Symp. on Computer Architecture (ISCA), June 2013.
[21]
S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and Arvind, "BlueDBM: an appliance for big data analytics," in Intl. Symp. on Computer Architecture (ISCA), June 2015.
[22]
"Apache Hadoop," http://hadoop.apache.org/.
[23]
"OpenStack Swift," https://docs.openstack.org/swift.
[24]
"Amazon S3," http://docs.aws.amazon.com/AmazonS3/latest/dev.
[25]
"Microsoft Azure Storage," https://docs.microsoft.com/en-us/azure/storage/.
[26]
A. Kaufmann, S. Peter, N. K. Sharma, T. Anderson, and A. Krishnamurthy, "High Performance Packet Processing with FlexNIC," in Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Apr. 2016.
[27]
Xilinx, Inc., "Vivado Design Suite," https://www.xilinx.com/products/design-tools/vivado.html.
[28]
"Xilinx VC707 Evaluation Kit," https://www.xilinx.com/products/boards-and-kits/ek-v7-vc707-g.html.
[29]
"Intel SSD 750 Series," http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-750-series.html.
[30]
"Broadcom NetXtreme II BCM57711 Dual-Port Direct Attach 10 GbE PCI Express Network Interface Card with TOE and iSCSI Offload," http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/broadcom-netxtreme-57711-spec-sheet.pdf.
[31]
"TESLA K20 GPU ACCELERATOR," https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf.
[32]
"PCIe2-2707 PCIe Gen2 Five Slot Expansion System," http://cyclone.com/products/expansion_systems/600-2707.php.
[33]
"Open-source MD5 hash HDL code," https://github.com/stass/md5_core.
[34]
"Open-source SHA1 hash HDL code," https://github.com/secworks/sha1.
[35]
"Open-source SHA256 hash HDL code," http://opencores.org/project,sha256_hash_core.
[36]
"Open-source AES encryption HDL code," http://opencores.org/project,tiny_aes.
[37]
"Open-source CRC hash HDL code," http://opencores.org/project,ultimate_crc.
[38]
"GZIP data compression core," https://www.xilinx.com/products/intellectual-property/1-7aisy9.html#productspecs.
[39]
"Header-Data Split Architecture," https://docs.microsoft.com/en-us/windows-hardware/drivers/network/header-data-split-architecture.
[40]
"NVM Express Specification," http://www.nvmexpress.org/wp-content/uploads/NVM_Express_1_2_1_Gold_20160603.pdf.
[41]
Broadcom Corporation, "Highly Integrated Media Access Controller Programmer's Guide," https://docs.broadcom.com/docs/1211168564430?eula=true.
[42]
I. Drago, M. Mellia, M. M Munafo, A. Sperotto, R. Sadre, and A. Pras, "Inside dropbox: understanding personal cloud storage services," in Proceedings of the 2012 ACM conference on Internet measurement conference. ACM, 2012, pp. 481--494.
[43]
H.-J. Kim, Y.-S. Lee, and J.-S. Kim, "Nvmedirect: A user-space i/o framework for application-specific optimization on nvme ssds." in HotStorage, 2016.
[44]
S. Peter, J. Li, I. Zhang, D. R. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, "Arrakis: The operating system is the control plane," in ACM Transactions on Computer Systems (TOCS), 2016.
[45]
J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz et al., "The stanford flash multiprocessor," in ACM SIGARCH Computer Architecture News, vol. 22, no. 2. IEEE Computer Society Press, 1994, pp. 302--313.
[46]
A. Klimovic, H. Litz, and C. Kozyrakis, "Reflex: Remote flash? local flash," in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2017, pp. 345--359.
[47]
P. Kumar and H. H. Huang, "Falcon: Scaling io performance in multissd volumes," in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, 2017, pp. 41--53.
[48]
S. Seshadri, M. Gahagan, S. Bhaskaran, T. Bunker, A. De, Y. Jin, Y. Liu, and S. Swanson, "Willow: A User-Programmable SSD," in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
[49]
B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang, "Biscuit: A Framework for Near-Data Processing of Big Data Workloads," in Proc. 43rd ACM/IEEE International Symposium on Computer Architecture (ISCA), 2016.
[50]
Advanced Micro Devices, Inc., "AMD RADEON PRO SSG," https://pro.radeon.com/en/product/pro-series/radeon-pro-ssg/.
[51]
Mellanox Technologies, "BlueField<sup>TM</sup> Smart NIC," http://www.mellanox.com/related-docs/prod_adapter_cards/PB_BlueField_Smart_NIC.pdf.
[52]
Advanced Micro Devices, Inc., "NVMe over fabric," http://www.mellanox.com/blog/2016/06/nvme-over-fabrics-standard-is-released/.
[53]
Mellanox Technologies, "Mellanox OFED GPUDirect RDMA," http://www.mellanox.com/related-docs/prod_software/PB_GPUDirect_RDMA.PDF.
[54]
L. Oden and H. Fröning, "Infiniband verbs on gpu: a case study of controlling an infiniband network device from the gpu," The International Journal of High Performance Computing Applications, vol. 31, no. 4, pp. 274--284, 2017.
[55]
S. Neuwirth, D. Frey, M. Nuessle, and U. Bruening, "Scalable Communication Architecture for Network-Attached Accelerators," in Proc. 21st IEEE Symposium on High Performance Computer Architecture (HPCA), 2015.
[56]
Mellanox Technologies, "NVIDIA GPUDirect<sup>TM</sup> Technology - Accelerating GPU-based Systems," http://www.mellanox.com/pdf/whitepapers/TB_GPU_Direct.pdf.
[57]
Advanced Micro Devices, Inc., "DirectGMA on AMD's FirePro GPUs," https://www.amd.com/Documents/SDI-tech-brief.pdf.
[58]
S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein, "GPUnet: Networking Abstractions for GPU Programs," in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
[59]
E. Agostini, D. Rossetti, and S. Potluri, "Offloading communication control logic in gpu accelerated applications," in Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 2017, pp. 248--257.

Cited By

View all
  • (2022)SmartFVM: A Fast, Flexible, and Scalable Hardware-based Virtualization for Commodity Storage DevicesACM Transactions on Storage10.1145/351121318:2(1-27)Online publication date: 12-Apr-2022
  • (2020)FVMProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488820(955-971)Online publication date: 4-Nov-2020
  • (2019)FIDRProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358303(239-252)Online publication date: 12-Oct-2019

Index Terms

  1. DCS-ctrl: a fast and flexible device-control mechanism for device-centric server architecture
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture
      June 2018
      884 pages
      ISBN:9781538659847

      Publisher

      IEEE Press

      Publication History

      Published: 02 June 2018

      Check for updates

      Author Tags

      1. FPGA-based accelerator
      2. I/O optimization
      3. device-to-device communication
      4. server architecture

      Qualifiers

      • Research-article

      Conference

      ISCA '18

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 13 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)SmartFVM: A Fast, Flexible, and Scalable Hardware-based Virtualization for Commodity Storage DevicesACM Transactions on Storage10.1145/351121318:2(1-27)Online publication date: 12-Apr-2022
      • (2020)FVMProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488820(955-971)Online publication date: 4-Nov-2020
      • (2019)FIDRProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358303(239-252)Online publication date: 12-Oct-2019

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media