Abstract
Dataflow is a parallel and generic model of computation that is agnostic of the underlying multi/many-core architecture executing it. State-of-the-art frameworks allow fast development of dataflow applications providing memory, communicating, and computing optimizations by design time exploration. However, the frameworks usually do not consider cache memory behavior when generating code. A generally accepted idea is that bigger and multi-level caches improve the performance of applications. This work evaluates such a hypothesis in a broad experiment campaign adopting different multi-core configurations related to the number of cores and cache parameters (size, sharing, controllers). The results show that bigger is not always better, and the foreseen future of more cores and bigger caches do not guarantee software-free better performance for dataflow applications. Additionally, this work investigates the adoption of two memory management strategies for dataflow applications: Copy-on-Write (CoW) and Non-Temporal Memory transfers (NTM). Experimental results addressing state-of-the-art applications show that NTM and CoW can contribute to reduce the execution time to -5.3% and \(-15.8\%\), respectively. CoW, specifically, shows improvements up to -21.8% in energy consumption with -16.8% of average among 22 different cache configurations.
Similar content being viewed by others
Availability of Data and Material
Not applicable.
Code Availability
Not applicable.
Notes
SIGSEGV is a synchronously-generated signal and is guaranteed to be delivered to the causing POSIX thread [22].
References
Furtunato, A. F. A., Georgiou, K., Eder, K., & Xavier-De-Souza, S. (2020). When parallel speedups hit the memory wall. IEEE Access, 8, 79225–79238. https://doi.org/10.1109/ACCESS.2020.2990418
Pelcat, M., Desnos, K., Heulot, J., Guy, C., Nezan, J., Aridhi, S. (2014). Preesm: A dataflow-based rapid prototyping framework for simplifying multicore dsp programming. In: European Embedded Design in Education and Research Conference (EDERC), pp. 36–40. https://doi.org/10.1109/EDERC.2014.6924354
Carlson, T. E., Heirman, W., & Eeckhout, L. (2011). Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. https://doi.org/10.1145/2063384.2063454
Slingerland, N., & Smith, A. (2001). Cache Performance for Multimedia Applications. In: International Conference on Supercomputing (ICS), ICS ’01, pp. 204–217. ACM, New York. https://doi.org/10.1145/377792.377833
Alves, M. A. Z., Freitas, H. C., & Navaux, P. O. A. (2009). Investigation of shared l2 cache on many-core processors. In: International Conference on Architecture of Computing Systems, pp. 1–10
Garcia, V., Gomez-Luna, J., Grass, T., Rico, A., Ayguade, E., & Pena, A. (2016). Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. IEEE, New York (2016). https://doi.org/10.1109/IISWC.2016.7581277
Domagala, L., van Amstel, D., & Rastello, F. (2016). Generalized Cache Tiling for Dataflow Programs. In: SIGPLAN/SIGBED, LCTES, pp. 52–61. ACM, New York. https://doi.org/10.1145/2907950.2907960
Maghazeh, A., Chattopadhyay, S., Eles, P., & Peng, Z. (2019). Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications. In: Design, Automation, and Test in Europe (DATE), pp. 570–575. IEEE, Florence. https://doi.org/10.23919/DATE.2019.8714861
Stoutchinin, A., & Benini, L. (2019). Streamdrive: A dynamic dataflow framework for clustered embedded architectures. Journal of Signal Processing System, 91(3–4), 275–301. https://doi.org/10.1007/s11265-018-1351-1
Basilio, B. (2021). Fraguela and Diego Andrade: A software cache autotuning strategy for dataflow computing with upc++ depspawn. Computational and Mathematical Methods 1(1), 1–14. https://doi.org/10.1002/cmm4.1148
Bovet, D. P., & Cesati, M. (2006). Understanding the Linux kernel, 3rd edn., chap. 10, p. 295. O’Reilly
Intel Corporation. (2020). Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes. Intel Corporation
Le, Q. T., Stern, J., & Brenner, S. (2020). Fast memcpy with SPDK and Intel® I/OAT DMA Engine. Retrieved March 15, 2021. https://software.intel.com/content/www/us/en/develop/articles/fast-memcpy-using-spdk-and-ioat-dma-engine.html
Desnos, K., Pelcat, M., Nezan, J. F., & Aridhi, S. (2016). On memory reuse between inputs and outputs of dataflow actors. ACM Transactions on Embedded Computing Systems 15(2). https://doi.org/10.1145/2871744
Kurd, N., Mosalikanti, P., Neidengard, M., Douglas, J., & Kumar, R. (2009). Next generation intel core micro-architecture (nehalem) clocking. IEEE Journal of Solid-State Circuits, 44(4), 1121–1129. https://doi.org/10.1109/JSSC.2009.2014023
Kim, T., Sun, Z., Chen, H., Wang, H., & Tan, S. X. (2017). Energy and lifetime optimizations for dark silicon manycore microprocessor considering both hard and soft errors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25(9), 2561–2574. https://doi.org/10.1109/TVLSI.2017.2707401
Rathore, V., Chaturvedi, V., Singh, A., Srikanthan, T., & Shafique, M. (2020). Longevity framework: Leveraging online integrated aging-aware hierarchical mapping and vf-selection for lifetime reliability optimization in manycore processors. IEEE Transactions on Computers pp. 1–1. https://doi.org/10.1109/TC.2020.3006571
PREESM. (2021). PREESM Applications Repository (https://github.com/preesm/preesm-apps).
Hamzah, R., & Ibrahim, H. (2015). Literature Survey on Stereo Vision Disparity Map Algorithms. Journal of Sensors, 16(1), 1–23. https://doi.org/10.1155/2016/8742920
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 1150–1157 vol.2. https://doi.org/10.1109/ICCV.1999.790410
Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., & Jouppi, N. P. (2009). Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In: International Symposium on Microarchitecture (MICRO), pp. 469–480. IEEE, New York, NY, USA.
IEEE. (2017). IEEE Standard for Information Technology–Portable Operating System Interface (POSIX(R)) Base Specifications, Issue 7. IEEE Std 1003.1-2017 1(1), 1–3951. https://doi.org/10.1109/IEEESTD.2018.8277153
Funding
This work is supported by the Agence Nationale de la Recherche under Grant No.: ANR-17-CE24-0018 We would like to give special thanks to the PREESM and Sniper communities for actively participating in the development of the tools which offer solid basements to this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ghasemi, A., Ruaro, M., Cataldo, R. et al. The Impact of Cache and Dynamic Memory Management in Static Dataflow Applications. J Sign Process Syst 94, 721–738 (2022). https://doi.org/10.1007/s11265-021-01730-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-021-01730-7