Abstract
The unpredictable nature of irregular memory accesses in a mixed memory applications such as deep learning application poses many challenges due to the communication issues. Typically, a multi-GPU node that has a large number of simultaneous memory requests consumes almost 80% of the processing time for memory mapping. This calls for characterization of mixed regular and irregular memory accesses so that memory divergence can be simplified to improve performance. In this paper, using large deviations principle, it is shown that the mixed regular and irregular memory accesses can be viewed as a combination of continuous and discrete functions. This view point is proved to give better performance through characterization of memory divergence in multi-GPU node using the sub-additivity property. Further, a detection test procedure based on quenched large deviations model is proposed which generates threshold values for optimizing the memory mapping in data intensive applications and hence it will improve the performance.
Similar content being viewed by others
References
Al-Ayyoub M, AlZu’bi S, Jararweh Y et al (2016) Accelerating 3D medical volume segmentation using GPUs. Multimed Tools Appl. https://doi.org/10.1007/11042-016-4218-0
Alsmirat MA, Jararweh Y, Al-Ayyoub M, Shehab MA, Gupta BB (2017) Accelerating compute intensive medical imaging segmentation algorithms using hybrid CPU–GPU implementations. Multimed Tools Appl 76(3):3537–3555
Ausavarungnirun R, Ghose S, Kayiran O, Loh GH, Das CR, Kandemir MT, Mutlu O (2015) Exploiting inter-warp heterogeneity to improve GPGPU performance. In: 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, pp 25–38
Bertsimas D, Paschalidis IC, Tsitsiklis JN et al (1998) On the large deviations behavior of acyclic networks of \( g/g/1\) queues. Ann Appl Probab 8(4):1027–1069
Bucklew JA (1990) Large deviation techniques in decision, simulation, and estimation. Wiley, New York
Burtscher M, Nasre R, Pingali K (2012) A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 141–151
Cabezas J, Jordà M, Gelado I, Navarro N, Hwu WM (2015) GPU-SM: shared memory multi-GPU programming. In: Proceedings of the 8th Workshop on General Purpose Processing using GPUs. ACM, pp 13–24
Campos D, Drewitz A, Ramirez AF, Rassoul-Agha F, Seppalainen T (2011) Level 1 quenched large deviation principle for random walk in dynamic random environment. arXiv preprint arXiv:1105.5726
Chatterjee N, O’Connor M, Loh GH, Jayasena N, Balasubramonian R (2014) Managing dram latency divergence in irregular GPGPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, pp 128–139
Chatterjee S, Varadhan S (2011) Large deviations for random matrices. arXiv preprint arXiv:1106.4366
Choi J (2012) On large deviations of HARQ with incremental redundancy over fading channels. IEEE Commun Lett 16(6):913–916
den Hollander F (2010) A key large deviation principle for interacting stochastic systems. In: Proceedings of the International Congress of Mathematicians, Hyderabad, India, vol 4, pp 2258–2274
Dembo A, Zeitouni O (2010) Large deviations techniques and applications. Corrected reprint of the second (1998) edition. Stochastic modelling and applied probability, vol 38
Faraji I, Mirsadeghi SH, Afsahi A (2016) Topology-aware GPU selection on multi-GPU nodes. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, pp 712–720
Jararweh Y, Al-Ayyoub M, Fakirah M et al (2017) Improving the performance of the Needleman–Wunsch algorithm using parallelization and vectorization techniques. Multimed Tools Appl. https://doi.org/10.1007/s11042-017-5092-0
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732
Kim YD, Park E, Yoo S, Choi T, Yang L, Shin D (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530
Lee D, Subramanian L, Ausavarungnirun R, Choi J, Mutlu O (2015) Decoupled direct memory access: isolating CPU and IO traffic by leveraging a dual-data-port dram. In: 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, pp 174–187
Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing memory efficiency for deep convolutional neural networks on GPUS. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 633–644 (2016)
Liu B, Wang M, Foroosh H, Tappen M, Pensky M (2015) Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 806–814
Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, pp 99–110 (2013)
Rünger G, Schwind M (2009) Parallelization Strategies for Mixed Regular-Irregular Applications on Multicore-Systems. In: Dou Y, Gruber R, Joller JM (eds) Advanced Parallel Processing Technologies. PPT 2009. Lecture Notes in Computer Science, vol 5737. Springer, Berlin, Heidelberg
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp 568–576
Varadhan S et al (2008) Large deviations. Ann Probab 36(2):397–419
Wang B, Yu W, Sun XH, Wang X (2015) Dacache: memory divergence-aware gpu cache management. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, pp 89–98
Wang H, Zhao H, Lin B, Xu J (2012) Robust pipelined memory system with worst case performance guarantee for network processing. IEEE Trans Comput 61(10):1386–1400
Wen W, Wu C, Wang Y, Chen Y, Li H (2016) Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp 2074–2082
Wu J, Xiong X, Berrocal E, Wang J, Lan Z (2017) Topology mapping of irregular parallel applications on torus-connected supercomputers. J Supercomput 73(4):1691–1714
Yadan O, Adams K, Taigman Y, Ranzato M (2013) Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853
Zhang C, Tabkhi H, Schirner G (2016) Studying inter-warp divergence aware execution on gpus. IEEE Comput Archit Lett 15(2):117–120
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tamizharasan, P.S., Ramasubramanian, N. Analysis of large deviations behavior of multi-GPU memory access in deep learning. J Supercomput 74, 2199–2212 (2018). https://doi.org/10.1007/s11227-018-2246-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2246-4