Analysis of large deviations behavior of multi-GPU memory access in deep learning

521 Accesses
Explore all metrics

Abstract

The unpredictable nature of irregular memory accesses in a mixed memory applications such as deep learning application poses many challenges due to the communication issues. Typically, a multi-GPU node that has a large number of simultaneous memory requests consumes almost 80% of the processing time for memory mapping. This calls for characterization of mixed regular and irregular memory accesses so that memory divergence can be simplified to improve performance. In this paper, using large deviations principle, it is shown that the mixed regular and irregular memory accesses can be viewed as a combination of continuous and discrete functions. This view point is proved to give better performance through characterization of memory divergence in multi-GPU node using the sub-additivity property. Further, a detection test procedure based on quenched large deviations model is proposed which generates threshold values for optimizing the memory mapping in data intensive applications and hence it will improve the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

MuxFlow: efficient GPU sharing in production-level clusters with more than 10000 GPUs

Article 13 December 2024

Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Article 11 November 2022

Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures

References

Al-Ayyoub M, AlZu’bi S, Jararweh Y et al (2016) Accelerating 3D medical volume segmentation using GPUs. Multimed Tools Appl. https://doi.org/10.1007/11042-016-4218-0
Alsmirat MA, Jararweh Y, Al-Ayyoub M, Shehab MA, Gupta BB (2017) Accelerating compute intensive medical imaging segmentation algorithms using hybrid CPU–GPU implementations. Multimed Tools Appl 76(3):3537–3555
Article Google Scholar
Ausavarungnirun R, Ghose S, Kayiran O, Loh GH, Das CR, Kandemir MT, Mutlu O (2015) Exploiting inter-warp heterogeneity to improve GPGPU performance. In: 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, pp 25–38
Bertsimas D, Paschalidis IC, Tsitsiklis JN et al (1998) On the large deviations behavior of acyclic networks of \( g/g/1\) queues. Ann Appl Probab 8(4):1027–1069
Article MathSciNet MATH Google Scholar
Bucklew JA (1990) Large deviation techniques in decision, simulation, and estimation. Wiley, New York
Google Scholar
Burtscher M, Nasre R, Pingali K (2012) A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 141–151
Cabezas J, Jordà M, Gelado I, Navarro N, Hwu WM (2015) GPU-SM: shared memory multi-GPU programming. In: Proceedings of the 8th Workshop on General Purpose Processing using GPUs. ACM, pp 13–24
Campos D, Drewitz A, Ramirez AF, Rassoul-Agha F, Seppalainen T (2011) Level 1 quenched large deviation principle for random walk in dynamic random environment. arXiv preprint arXiv:1105.5726
Chatterjee N, O’Connor M, Loh GH, Jayasena N, Balasubramonian R (2014) Managing dram latency divergence in irregular GPGPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, pp 128–139
Chatterjee S, Varadhan S (2011) Large deviations for random matrices. arXiv preprint arXiv:1106.4366
Choi J (2012) On large deviations of HARQ with incremental redundancy over fading channels. IEEE Commun Lett 16(6):913–916
Article Google Scholar
den Hollander F (2010) A key large deviation principle for interacting stochastic systems. In: Proceedings of the International Congress of Mathematicians, Hyderabad, India, vol 4, pp 2258–2274
Dembo A, Zeitouni O (2010) Large deviations techniques and applications. Corrected reprint of the second (1998) edition. Stochastic modelling and applied probability, vol 38
Faraji I, Mirsadeghi SH, Afsahi A (2016) Topology-aware GPU selection on multi-GPU nodes. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, pp 712–720
Jararweh Y, Al-Ayyoub M, Fakirah M et al (2017) Improving the performance of the Needleman–Wunsch algorithm using parallelization and vectorization techniques. Multimed Tools Appl. https://doi.org/10.1007/s11042-017-5092-0
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732
Kim YD, Park E, Yoo S, Choi T, Yang L, Shin D (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530
Lee D, Subramanian L, Ausavarungnirun R, Choi J, Mutlu O (2015) Decoupled direct memory access: isolating CPU and IO traffic by leveraging a dual-data-port dram. In: 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, pp 174–187
Li C, Yang Y, Feng M, Chakradhar S, Zhou H (2016) Optimizing memory efficiency for deep convolutional neural networks on GPUS. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 633–644 (2016)
Liu B, Wang M, Foroosh H, Tappen M, Pensky M (2015) Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 806–814
Rogers TG, O’Connor M, Aamodt TM (2013) Divergence-aware warp scheduling. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, pp 99–110 (2013)
Rünger G, Schwind M (2009) Parallelization Strategies for Mixed Regular-Irregular Applications on Multicore-Systems. In: Dou Y, Gruber R, Joller JM (eds) Advanced Parallel Processing Technologies. PPT 2009. Lecture Notes in Computer Science, vol 5737. Springer, Berlin, Heidelberg
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp 568–576
Varadhan S et al (2008) Large deviations. Ann Probab 36(2):397–419
Article MathSciNet MATH Google Scholar
Wang B, Yu W, Sun XH, Wang X (2015) Dacache: memory divergence-aware gpu cache management. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, pp 89–98
Wang H, Zhao H, Lin B, Xu J (2012) Robust pipelined memory system with worst case performance guarantee for network processing. IEEE Trans Comput 61(10):1386–1400
Article MathSciNet MATH Google Scholar
Wen W, Wu C, Wang Y, Chen Y, Li H (2016) Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp 2074–2082
Wu J, Xiong X, Berrocal E, Wang J, Lan Z (2017) Topology mapping of irregular parallel applications on torus-connected supercomputers. J Supercomput 73(4):1691–1714
Article Google Scholar
Yadan O, Adams K, Taigman Y, Ranzato M (2013) Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853
Zhang C, Tabkhi H, Schirner G (2016) Studying inter-warp divergence aware execution on gpus. IEEE Comput Archit Lett 15(2):117–120
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, National Institute of Technology, Tiruchirappalli, Tamilnadu, India
P. S. Tamizharasan & N. Ramasubramanian

Authors

P. S. Tamizharasan
View author publications
You can also search for this author in PubMed Google Scholar
N. Ramasubramanian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. S. Tamizharasan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tamizharasan, P.S., Ramasubramanian, N. Analysis of large deviations behavior of multi-GPU memory access in deep learning. J Supercomput 74, 2199–2212 (2018). https://doi.org/10.1007/s11227-018-2246-4

Download citation

Published: 17 January 2018
Issue Date: May 2018
DOI: https://doi.org/10.1007/s11227-018-2246-4

Analysis of large deviations behavior of multi-GPU memory access in deep learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MuxFlow: efficient GPU sharing in production-level clusters with more than 10000 GPUs

Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Analysis of large deviations behavior of multi-GPU memory access in deep learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MuxFlow: efficient GPU sharing in production-level clusters with more than 10000 GPUs

Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Efficient Deep Learning Using Non-volatile Memory Technology in GPU Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now