Abstract
We present the hardware design and implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). Our memory system supports both implicit communication via caches, and explicit communication via directly accessible local (“scratchpad”) memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks that lie near each processor, so that portions of them operate as 2nd level (local) cache, while the rest operate as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardware synchronization primitives: counters, and queues. We describe the NI design, the hardware cost, and the latencies of our FPGA-based prototype implementation that integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller. One-way, end-to-end, user-level communication completes within about 20 clock cycles for short transfer sizes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Write-back policy can also be used, provided that coherence between L1 and L2 is maintained. However, the write-through policy simplifies coherence without any performance loss. The inclusion property assumed here, is more intuitive than exclusion that would require moving locked lines between the cache levels.
References
Banakar, R., Steinke, S., Lee, B., Balakrishnan, M., Marwedel, P.: Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In: Proceedings of 10th International Symposium on HW/SW Codesign (CODES), Colorado (2002)
Bellens, P., Perez, J., Badia, R., Labarta, J.: CellSs: a programming model for the cell BE architecture. In: Proceedings of ACM/IEEE Conference on Supercomputing (SC), Tampa, Florida (2006)
Bhoedjang, R., Ruhl, T., Bal, H.: User-level network interface protocols. IEEE Comput. 31(11), 53–60 (1998)
Brewer, E., Chong, F., Liu, L., Sharma, S., Kubiatowicz, J.: Remote queues: exposing message queues for optimization and atomicity. In: Proceedings of 7th ACM Symposium on Parallel Algorithms and Architectures (SPAA), St. Barbara (1995)
Byrd, G., Delagi, B.: Streamline: cache-based message passing in scalable multiprocessors. In: Proceedings of the International Conference on Parallel Processing (ICPP) (1991)
Byrd, G.T., Flynn, M.: Producer-consumer communication in distributed shared memory multiprocessors. Proc. IEEE 87(3), 456–466 (1999)
Fatahalian, K., et al.: Sequoia: programming the memory hierarchy. In: Proceedings of ACM/IEEE Conference on Supercomputing (SC), Florida (2006)
Heinlein, J., Gharachorloo, K., Dresser, S., Gupta, A.: Integration of message passing and shared memory in the Stanford FLASH multiprocessor. ACM SIGOPS Oper. Syst. Rev. 28(5), 38–50 (1994)
Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005)
Kapasi, U., et al.: Programmable stream processors. IEEE Comput. 36(8), 54–62 (2003). https://doi.org/10.1109/MC.2003.1220582
Katevenis, M.: Interprocessor communication seen as load-store instruction generalization. In: The Future of Computing, Essays in Memory of Stamatis Vassiliadis, Delft, The Netherlands (2007)
Kavadias, S., Katevenis, M., Zampetakis, M., Nikolopoulos, D.: On-chip communication and synchronization with cache-integrated network interfaces. In: Proceedings of ACM International Conference on Computing Frontiers (CF 2010), Bertinoro, Italy (2010)
Kubiatowicz, J., Agarwal, A.: Anatomy of a message in the Alewife multiprocessor. In: Proceedings of the ACM International Conference on Supercomputing (ICS), Tokyo (1993)
Mai, K., Paaske, T., Jayasena, N., Ho, R., Dally, W., Horowitz, M.: Smart memories: a modular reconfigurable architecture. In: Proceedings of the 27th International Symposium on Computer Architecture (ISCA) (2000)
Markatos, E., Katevenis, M.: Telegraphos: high-performance networking for parallel processing on workstation clusters. In: Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture (HPCA), San Jose, CA USA (1996)
Mukherjee, S., Falsafi, B., Hill, M., Wood, D.: Coherent network interfaces for fine-grain communication. In: Proceedings of the 23rd International Symposium on Computer Architecture (ISCA) (1996)
Sankaralingam, K., et al.: Distributed microarchitectural protocols in the TRIPS prototype processor. In: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO) (2006)
Acknowledgments
This work was supported by the European Commission in the context of the projects SARC (FP6 IP #27648) and UNiSIX (Marie-Curie #509595). We also thank, for their assistance in designing the architecture and developing the prototype: Dimitris Nikolopoulos, Alex Ramirez, Georgi Gaydadjiev, Spyros Lyberis, Christos Sotiriou, Dimitris Tsaliagos, and Michael Ligerakis.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Kalokerinos, G. et al. (2019). Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability. In: Silvano, C., Bertels, K., Schulte, M. (eds) Transactions on High-Performance Embedded Architectures and Compilers V. Lecture Notes in Computer Science(), vol 11225. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58834-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-662-58834-5_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58833-8
Online ISBN: 978-3-662-58834-5
eBook Packages: Computer ScienceComputer Science (R0)