More Web Proxy on the site http://driver.im/

article

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Authors:

Ricardo Ferreira,

Monica Pereira,

Carlos A. LisbźA,

Luigi CarroAuthors Info & Claims

Journal of Signal Processing Systems, Volume 85, Issue 1

Pages 45 - 66

https://doi.org/10.1007/s11265-015-0974-8

Published: 01 October 2016 Publication History

Abstract

In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 bigger than a standalone 8-issue VLIW softcore processor.

References

[1]

Ahn, M., Yoon, J.W., Paek, Y., Kim, Y., Kiemb, M., Choi, K. (2006). A spatial mapping algorithm for heterogeneous coarse-grained reconfigurable architectures. In: Proceedings DATE, pp. 363---368.

Digital Library

[2]

Arnold, O., Matus, E., Noethen, B., Winter, M., Limberg, T., Fettweis, G. (2014). Tomahawk: Parallelism and heterogeneity in communications signal processing mpsocs. ACM Transactions on Embedded Computing Systems, 13(3s), 107:1---107:241.

Digital Library

[3]

Beck, A.C.S., Rutzig, M.B., Gaydadjiev, G., Carro, L. (2008). Transparent reconfigurable acceleration for heterogeneous embedded applications. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1208---1213.

Digital Library

[4]

Bispo, J., Paulino, N., Cardoso, J.M., Ferreira, J.C. (2013). Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units. International Journal of Reconfigurable Computing

[5]

Bispo, J., Paulino, N., Ferreira, J., Cardoso, J. (2012). Transparent trace-based binary acceleration for reconfigurable hw/sw systems. IEEE Transactions on Industrial Informatics.

[6]

Boppu, S., Hannig, F., Teich, J. (2014). Compact code generation for tightly-coupled processor arrays. Journal of Signal Processing Systems, 77(1-2), 5---29.

Digital Library

[7]

Bouwens, F., Berekovic, M., Kanstein, A., Gaydadjiev, G. (2007). Architectural exploration of the adres coarse-grained reconfigurable array. In Proceedings ARC pp. 1---13.

Digital Library

[8]

Chen, L., & Mitra, T. (2012). Graph minor approach for application mapping on cgras. In: Proceedings FPT.

[9]

De Sutter, B., Coene, P., Vander Aa, T., Mei, B. (2008). Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proceedings LCTES (pp. 151---160).

Digital Library

[10]

Ferreira, R., Duarte, V., Meireles, W., Pereira, M., Carro, L., Wong, S. (2013). A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures. In SAMOS XIII.

[11]

Ferreira, R., Vendramini, J.G., Mucida, L., Pereira, M.M., Carro, L. (2011). An fpga-based heterogeneous coarse-grained dynamically reconfigurable architecture. In: Proceedings CASES.

Digital Library

[12]

Friedman, S., Carroll, A., Van Essen, B., Ylvisaker, B., Ebeling, C., Hauck, S. (2009). Spr: an architecture-adaptive cgra mapping tool. In: Proceeding of the ACM/SIGDA international symposium on field programmable gate arrays, FPGA '09 (pp. 191---200). New York: ACM

Digital Library

[13]

Goel, N., Kumar, A., Panda, P.R. (2014). Shared-port register file architecture for low-energy vliw processors. ACM Transactions Architectural Code Optimization, 11(1).

Digital Library

[14]

Hamzeh, M., Shrivastava, A., Vrudhula, S. (2012). Epimap: Using epimorphism to map applications on CGRAs. In Proceeding of DAC conference (pp. 1280---1287).

Digital Library

[15]

Hamzeh, M., Shrivastava, A., Vrudhula, S. (2014). Branch-aware loop mapping on CGRAs. In Proceeding of DAC conference on design automation conference (pp. 1---6). ACM.

Digital Library

[16]

Hamzeh, M., Shrivastava, A., Vrudhula, S.B. (2013). Regimap: register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceeding of DAC conference (p. 18).

Digital Library

[17]

Hartenstein, R. (2001). Coarse grain reconfigurable architecture (embedded tutorial). In: Proceedings of the 2001 asia and south pacific design automation conference, ASP-DAC '01.

Digital Library

[18]

Hatanaka, A., & Bagherzadeh, N. (2007). A modulo scheduling algorithm for a coarse-grain reconfigurable array template. In IPDPS 2007 (pp. 1---8).

[19]

Hoogerbrugge, J., & Corporaal, H. (1994). Register file port requirements of transport triggered architectures. In: Proceedings of the 27th annual international symposium on microarchitecture. (pp. 191---195). ACM.

Digital Library

[20]

Jääskeläinen, P., Kultala, H., Viitanen, T., Takala, J. (2014). Code density and energy efficiency of exposed datapath architectures. Journal of Signal Processing Systems, 1---16.

[21]

Kim, Y., Lee, J., Shrivastava, A., Yoon, J., Cho, D., Paek, Y. (2011). High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Transactions on CAD of International Circuits and Systems, 30 (11), 1599 ---1609.

Digital Library

[22]

Laboratories, & H.P. (2014). Vex toolchain. http://www.hpl.hp.com/downloads/vex/.

[23]

Lee, G., Choi, K., Dutt, N. (2011). Mapping multi-domain applications onto coarse-grained reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(5), 637---650.

Digital Library

[24]

Lin, T.J., Chen, S.K., Kuo, Y.T., Liu, C.W., Hsiao, P.C. (2008). Design and implementation of a high-performance and complexity-effective vliw dsp for multimedia applications. Journal of Signal Processing Systems, 51(3), 209---223.

Digital Library

[25]

Loeffler, C., Ligtenberg, A., Moschytz, G.S. (1989). Practical fast 1-d dct algorithms with 11 multiplications. IEEE international conference on acoustics, speech, and signal processing, 1989. ICASSP-89 (pp. 988---991).

[26]

McCool, M. (2007). Signal processing and general-purpose computing and gpus {exploratory dsp}. IEEE Signal Processing Magazine, 24(3), 109---114.

[27]

Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R. (2002). Dresc: a retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings FPT pp. 166---173.

[28]

Mei, B., Vernalde, S., Verkest, D., Man, H.D., Lauwereins, R. (2003). Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In: Proceedings DATE.

Digital Library

[29]

Oh, T., Egger, B., Park, H., Mahlke, S. (2009). Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings LCTES pp. 21---30.

Digital Library

[30]

Paek, J.K., Choi, K., Lee, J. (2011). Binary acceleration using coarse-grained reconfigurable architecture. SIGARCH Computers Architecture News, 38(4), 33---39.

Digital Library

[31]

Park, H., Fan, K., Kudlur, M., Mahlke, S. (2006). Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures. In Proceedings CASES (pp. 136---146).

Digital Library

[32]

Park, H., Fan, K., Mahlke, S.A., Oh, T., Kim, H., Kim, H.s. (2008). Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In: Proceedings PACT.

Digital Library

[33]

Park, H., Park, Y., Mahlke, S. (2009). Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. In Proceedings MICRO (pp. 370---380).

Digital Library

[34]

Rau, B.R. (1994). Iterative modulo scheduling: an algorithm for software pipelining loops. In Proceedings MICRO (pp. 63---74).

Digital Library

[35]

Wong, S., Van As, T., Brown, G. (2008). ¿-vex: A reconfigurable and extensible softcore vliw processor. In International conference on field-programmable technology FPT (pp. 369---372). IEEE.

[36]

Yoon, J., Shrivastava, A., Park, S., Ahn, M., Jeyapaul, R., Paek, Y. (2008). Spkm: A novel graph drawing based algorithm for application mapping onto coarse-grained reconfigurable architectures. In Proceedings ASPDAC (pp. 776---782).

Digital Library

[37]

Zhou, L., Liu, H., Zhang, J. (2013). Loop acceleration by cluster-based cgra. IEICE Electronics Express, 10(16).

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility
1. General and reference
  1. Cross-computing tools and techniques
2. Software and its engineering
  1. Software notations and tools

Recommendations

Low overhead dynamic binary translation on ARM
PLDI '17

The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Low overhead dynamic binary translation on ARM
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation

The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Register Constrained Modulo Scheduling

Abstract--Software pipelining is an instruction scheduling technique that exploits the instruction level parallelism (ILP) available in loops by overlapping operations from various successive loop iterations. The main drawback of aggressive software ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Signal Processing Systems

Journal of Signal Processing Systems Volume 85, Issue 1

October 2016

176 pages

ISSN:1939-8018

EISSN:1939-8115

Issue’s Table of Contents

Copyright © Copyright © 2016 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 October 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents