More Web Proxy on the site http://driver.im/

research-article

Deep reinforcement learning in loop fusion problem

Authors:

Mahsa Ziraksima,

Shahriar Lotfi,

Jafar RazmaraAuthors Info & Claims

Volume 481, Issue C

Pages 102 - 120

https://doi.org/10.1016/j.neucom.2022.01.032

Published: 07 April 2022 Publication History

Abstract

Loops’ execution time and resource consumption are one of the interest points and vital issues in the field of appraising complex scientific or computational algorithms. This issue caused the proposal of Loop performance optimization techniques such as fusion. In the literature, loop fusion merges the loops by taking into account a set of properties associated with the loops or the system on which the resulting code will be executed. The number of these factors and their interactions on the one hand, and the high runtime of available comprehensive approaches, on the other hand, reveals the need for a new method that could be concerned for further progress in solving this NP-hard problem. For the first time, Deep Reinforcement Learning Loop Fusion (DRLLF) advanced to be an ideal solution for the challenge in this article. For the proposed framework, a particular matrix is configured as the inputs of a deep neural network based on the information of the problem, namely data dependencies, data reuse, loops’ types, and computer system’s register size. These randomly generated matrixes are used in the training phase by reinforcement learning to get the imperative experience on predicting a profitable distribution over loops’ various fusion orders. In the evaluations performed, the presented algorithm was able to achieve the same or better performance in terms of speedup rate, comparing with the methods under study, approximately averaged in 7.36 percent better results. The considerable improvement observed in the results, besides the low run time, proves the comprehensiveness and superiority of this approach.

References

[1]

E. H. M. Sha, N. L. Passos. Efficient polynomial-time nested loop fusion with full parallelism. (1999).

[2]

S. Blom, S. Darabi, M. Huisman, Verification of loop parallelisations, in: International conference on fundamental approaches to software engineering, Springer, Berlin, Heidelberg, 2015, pp. 202–217.

[3]

Q. Yi, K. Kennedy, Improving memory hierarchy performance through combined loop interchange and multi-level fusion, Int. J. High Perform. Comput. Appl. 18 (2) (2004) 237–253.

[4]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S.V. Adve, C.T. Chou, DeNovo: Rethinking the memory hierarchy for disciplined parallelism, in: 2011 International Conference on Parallel Architectures and Compilation Techniques, IEEE, 2011, pp. 155–166.

[5]

Z. Jie, Z. Rongcai, Y. Yuan, A nested loop fusion algorithm based on cost analysis, in: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, IEEE, 2012, pp. 1096–1101.

[6]

M. Liu, E.H.M. Sha, Q. Zhuge, Y. He, M. Qiu, Loop distribution and fusion with timing and code size optimization, J. Signal Process. Syst. 62 (3) (2011) 325–340.

[7]

J. Cong, P. Zhang, Y. Zou, Combined loop transformation and hierarchy allocation for data reuse optimization, in: 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), IEEE, 2011, pp. 185–192.

[8]

A. Darte, On the complexity of loop fusion, Parallel Comput. 26 (9) (2000) 1175–1193.

[9]

M. Liu, Q. Zhuge, Z. Shao, C. Xue, M. Qiu, E.M. Sha, Maximum loop distribution and fusion for two-level loops considering code size, 8th International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN’05), IEEE, 2005.

[10]

M. Ziraksima, S. Lotfi, H. Izadkhah, Loop fusion methods: a critical review, Int. J. Adv. Res. Comput. Commun. Eng. 6 (7) (2017) 1–8.

[11]

J.C. Pichel, D.B. Heras, J.C. Cabaleiro, F.F. Rivera, Performance optimization of irregular codes based on the combination of reordering and blocking techniques, Parallel Comput. 31 (8–9) (2005) 858–876.

[12]

S. Carr, C. Ding, P. Sweany, Improving software pipelining with unroll-and-jam, in: Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences, IEEE, 1996, pp. 183–192.

[13]

Q. Yi, J. Guo, Extensive parameterization and tuning of architecture-sensitive optimizations, Procedia Comput. Sci. 4 (2011) 2156–2165.

[14]

S. Lotfi, S. Parsa, Parallel loop generation and scheduling, J. Supercomput. 50 (3) (2009) 289–306.

[15]

S. Parsa, S. Lotfi, A new genetic algorithm for loop tiling, J. Supercomput. 37 (3) (2006) 249–269.

[16]

E. Hammami, Y. Slama, An overview on loop tiling techniques for code generation, in: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), IEEE, 2017, pp. 280–287.

[17]

N. Manjikian, T.S. Abdelrahman, Fusion of loops for parallelism and locality, No. 2 in: IEEE transactions on parallel and distributed systems, 1997, pp. 193–209.

[18]

I. Ştirb, H. Ciocârlie, Improving performance and energy consumption with loop fusion optimization and parallelization, in: 2016 IEEE 17th International Symposium on Computational Intelligence and Informatics (CINTI), IEEE, 2016, pp. 000099–000104.

[19]

A. Acharya, U. Bondhugula, A. Cohen, Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs, ACM Trans. Archit. Code Optimization (TACO) 17 (4) (2020) 1–26.

[20]

M. Liu, Q. Zhuge, Z. Shao, E.H.M. Sha, September). General loop fusion technique for nested loops considering timing and code size, in: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, 2004, pp. 190–201.

[21]

P. Boulet, A. Darte, G.A. Silber, F. Vivien, Loop parallelization algorithms: from parallelism extraction to code generation, Parallel Comput. 24 (3–4) (1998) 421–444.

[22]

G. Roth, K. Kennedy, Loop fusion in high performance Fortran, in: Proceedings of the 12th international conference on Supercomputing, 1998, pp. 125–132.

[23]

B. Blainey, C. Barton, J.N. Amaral, Removing impediments to loop fusion through code transformations, in: International Workshop on Languages and Compilers for Parallel Computing, Springer, Berlin, Heidelberg, 2002, pp. 309–328.

[24]

P.S. Rawat, A. Sukumaran-Rajam, A. Rountev, F. Rastello, L.N. Pouchet, P. Sadayappan, Associative instruction reordering to alleviate register pressure, in: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2018, pp. 590–602.

[25]

G. Gao, R. Olsen, V. Sarkar, R. Thekkath, Collective loop fusion for array contraction, in: International Workshop on Languages and Compilers for Parallel Computing, Springer, Berlin, Heidelberg, 1992, pp. 281–295.

[26]

A. Jangda, U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines, ACM SIGPLAN Notices 53 (1) (2018) 261–275.

[27]

M.R. Kristensen, S.A. Lund, T. Blum, J. Avery, Fusion of parallel array operations, in: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016, pp. 71–85.

[28]

A. Qasem, K. Kennedy, Profitable loop fusion and tiling using model-driven empirical search, in: Proceedings of the 20th annual international conference on Supercomputing, 2006, pp. 249–258.

[29]

J. Xue, Q. Huang, M. Guo, Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences, in: 2005 International Conference on Parallel Processing (ICPP’05), IEEE, 2005, pp. 107–115.

[30]

J. Warren, A hierarchical basis for reordering transformations, in: Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, ACM, 1984, pp. 272–282.

[31]

K. Kennedy, K.S. McKinley, Maximizing loop parallelism and improving data locality via loop fusion and distribution, in: International Workshop on Languages and Compilers for Parallel Computing, Springer, Berlin, Heidelberg, 1993, pp. 301–320.

[32]

K. Kennedy, K. S. McKinley. Typed fusion with applications to parallel and sequential code generation. (1994).

[33]

S.K. Singhai, K.S. McKinley, A parametrized loop fusion algorithm for improving parallelism and cache locality, Comput. J. 40 (6) (1997) 340–355.

[34]

N. Megiddo, V. Sarkar, Optimal weighted loop fusion for parallel programs, in: Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, ACM, 1997, pp. 282–291.

[35]

A. Fraboulet, K. Kodary, A. Mignotte, Loop fusion for memory space optimization, in: Proceedings of the 14th international Symposium on Systems Synthesis, 2001, pp. 95–100.

[36]

K. Kennedy, Fast greedy weighted fusion, Int. J. Parallel Prog. 29 (5) (2001) 463–491.

[37]

S. Verdoolaege, M. Bruynooghe, G. Janssens, F. Catthoor, Multidimensional incremental loop fusion for data locality, in: Application-Specific Systems, Architectures, and Processors, 2003. Proceedings. IEEE International Conference on, IEEE, 2003, pp. 17–27.

[38]

C. Ding, K. Kennedy, Improving effective bandwidth through compiler enhancement of global cache reuse, J. Parallel Distrib. Comput. 64 (1) (2004) 108–134.

[39]

P. Marchal, J.I. Gómez, F. Catthoor, Optimizing the memory bandwidth with loop fusion, in: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, 2004, pp. 188–193.

[40]

M. Qiu, E.H.M. Sha, M. Liu, M. Lin, S. Hua, L.T. Yang, Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSP, J. Parallel Distrib. Comput. 68 (4) (2008) 443–455.

Digital Library

[41]

W. Tian, C.J. Xue, M. Li, E. Chen, Loop fusion and reordering for register file optimization on stream processors, J. Syst. Softw. 85 (7) (2012) 1673–1681.

[42]

S. Mehta, P.H. Lin, P.C. Yew, Revisiting loop fusion in the polyhedral framework, No. 8 in: ACM SIGPLAN Notices, ACM, 2014, pp. 233–246.

[43]

S.J. Bigdello, M.J. Bigdello, H. Mohammadzadeh, Improving efficency of mathematical functions in image processing by loop fusion, in: 2018 5th International Conference on Electrical and Electronic Engineering (ICEEE), IEEE, 2018, pp. 334–339.

[44]

A. Imanishi Imanishi, K. Suenaga, A. Igarashi, A guess-and-assume approach to loop fusion for program verification, in: Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, 2017, pp. 2–14.

[45]

B. Qiao, O. Reiche, F. Hannig, J. Teich, From loop fusion to kernel fusion: a domain-specific approach to locality optimization, in: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), IEEE, 2019, pp. 242–253.

[46]

M. Ziraksima, S. Lotfi, H. Izadkhah, Using an evolutionary approach based on shortest common supersequence problem for loop fusion, Soft. Comput. 24 (10) (2020) 7231–7252.

[47]

A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, Springer, Berlin, 2003, p. 18.

[48]

K.J. Räihä, E. Ukkonen, The shortest common supersequence problem over binary alphabet is NP-complete, Theoret. Comput. Sci. 16 (2) (1981) 187–198.

[49]

P.J. Radoszewski, T. Kociumaka, S.P. Pissis, W. Rytter, J. Straszynski, T. Walen, W. Zuba, Weighted Shortest Common Supersequence Problem Revisited, in: String Processing and Information Retrieval: 26th International Symposium, SPIRE 2019, Segovia, Spain, October 7–9, 2019, Proceedings, Springer Nature, 2019, p. 221.

[50]

C. Gagné, W.L. Price, M. Gravel, Comparing an ACO algorithm with other heuristics for the single machine scheduling problem with sequence-dependent setup times, J. Operat. Res. Society 53 (8) (2002) 895–906.

[51]

W. Deng, J. Xu, H. Zhao, An improved ant colony optimization algorithm based on hybrid strategies for scheduling problem, IEEE Access 7 (2019) 20281–20292.

[52]

D.L. Applegate, R.E. Bixby, V. Chvátal, W.J. Cook, The Traveling Salesman Problem, Princeton University Press, 2011.

[53]

E. Osaba, J. Del Ser, A. Sadollah, M.N. Bilbao, D. Camacho, A discrete water cycle algorithm for solving the symmetric and asymmetric traveling salesman problem, Appl. Soft Comput. 71 (2018) 277–290.

Digital Library

[54]

K. Kennedy, J.R. Allen, Optimizing Compilers for Modern Architectures: A Dependence-Based Approach, Morgan Kaufmann Publishers Inc., 2001.

Digital Library

[55]

W.A.K. Abu-Sufah, Improving the Performance of Virtual Memory Computers, University of Illinois at Urbana-Champaign, 1979.

[56]

H. Zima, B. Chapman, Automatic Restructuring for Parallel and Vector Computers, CRC Press, 2020, pp. 135–168.

[57]

C. Cummins, P. Petoumenos, Z. Wang, H. Leather, End-to-end deep learning of optimization heuristics, in: 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), IEEE, 2017, pp. 219–232.

[58]

Z. Wang, M. O’Boyle, Machine learning in compiler optimization, Proc. IEEE 106 (11) (2018) 1879–1901.

[59]

S.S. Keerthi, B. Ravindran, A tutorial survey of reinforcement learning, Sadhana 19 (6) (1994) 851–889.

[60]

K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866. 2017.

[61]

L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: A survey, J. Artificial Intelligence Res. 4 (1996) 237–285.

[62]

B. Recht, A tour of reinforcement learning: the view from continuous control, Annu. Rev. Control, Robotics, Autonomous Syst. 2 (2019) 253–279.

[63]

M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, L.M. Rousseau, Learning heuristics for the tsp by policy gradient, in: International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, Springer, Cham, 2018, pp. 170–181.

[64]

Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274. (2017).

[65]

I. Bello, H. Pham, Q. V. Le, M. Norouzi, S. Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. (2016).

[66]

S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International conference on machine learning, PMLR, 2015, pp. 448–456.

[67]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, et al., Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008.

[68]

R.J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Mach. Learning 8 (3-4) (1992) 229–256.

[69]

X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. in: Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference Proceedings, 2010.

[70]

D. P. Kingma, J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. (2014).

[71]

F. McMahaon. The benchmark ‘‘Livermore’’. (2016). Available online at http://www.netlib.org/benchmark/. Accessed on Dec 2016.

[72]

L. Wang, W. Tembe, S. Pande, A framework for loop distribution on limited on-Chip memory processors, in: International Conference on Compiler Construction, Springer, Berlin, Heidelberg, 2000, pp. 141–156.

[73]

B.P. Lester, The Art of Parallel Programming, Prentice-Hall Inc., 1993.

Index Terms

Deep reinforcement learning in loop fusion problem

Index terms have been assigned to the content through auto-classification.

Recommendations

Improving the parallelism of iterative methods by aggressive loop fusion

Traditionally, loop nests are fused only when the data dependences in the loop nests are not violated. This paper presents a new loop fusion algorithm that is capable of fusing loop nests in the presence of fusion-preventing anti-dependences. All the ...
Exploitation of parallelism to nested loops with dependence cycles

In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-statements in a nested loop. The major findings include: (1) A sink variable renaming technique, which can reposition an undesired anti-...
General loop fusion technique for nested loops considering timing and code size
CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems

Loop fusion is commonly used to improve the instruction-level parallelism of loops for high-performance embedded computing systems. Loop fusion, however, is not always directly applicable because the fusion prevention dependencies may exist among loops. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 481, Issue C

Apr 2022

358 pages

ISSN:0925-2312

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 07 April 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents