[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory

Published: 03 October 2020 Publication History

Abstract

Compiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging.
Our compiler toolchain uses the LLVM framework and targets OPINCAA, a JIT vector assembler and coordination C++ library for Connex-S accelerating computations for an arbitrary CPU. Therefore, we address in the compiler middle end aspects of efficient vectorization, communication, and synchronization. We perform quantitative static analysis of the program useful, among others, for the symbolic-size compiler memory allocator and the coordination mechanism of OPINCAA. We also discuss the LLVM back end for the Connex-S processor and the methodology to automatically generate instruction selection code for emulating efficiently arithmetic and logical operations for non-native types such as 32-bit integer and 16-bit floating-point.
By using JIT vector assembling and by encoding the vector length of Connex-S as a parameter in the generated OPINCAA program, we achieve vector-length agnosticism to support execution on distinct embedded devices, such as several digital cameras with different resolutions, each equipped with custom-width Connex-S accelerators meant to save energy for the image processing kernels.
Since Connex-S has a limited capacity local scratchpad memory of 256 KB normally, we present how we also use the PPCG C-to-C code generator to perform data tiling to minimize the total kernel execution time, subject to fitting larger program data in the local memory. We devise an accurate cost model for the Connex-S accelerator to choose optimal performance tile sizes at compile time.
We successfully compile several simple benchmarks frequently used, for example, in high-performance and computer vision embedded applications. We report speedup factors of up to 11.33 when running them on a Connex-S accelerator with 128 16-bit integer lanes w.r.t. the dual-core ARM Cortex A9 host clocked at a frequency 6.67 times higher, with a total of two 128-bit Neon SIMD units.

References

[1]
2020. LLVM Documentation: TableGen. Retrieved from http://llvm.org/docs/TableGen/.
[2]
2020. The Polyhedral Model. Retrieved from http://polyhedral.info.
[3]
2017. Connex-S Accelerator Controller Specification.
[4]
2020. The Connex-S OPINCAA LLVM compiler. Retrieved from http://gitlab.dcae.pub.ro/research/ConnexRelated/OpincaaLLVM.
[5]
2020. The Connex OPINCAA library. Retrieved from http://gitlab.dcae.pub.ro/research/opincaa.
[6]
Randy Allen and Ken Kennedy. 1987. Automatic translation of FORTRAN programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491--542.
[7]
ARM. 2017. ARM Compiler Version 6.8—Scalable Vector Extension User Guide.
[8]
ARM Manchester Design Center. 2016. Support for Scalable Vector Architectures in LLVM IR.
[9]
Adrià Armejach, Helena Caminal, Juan M. Cebrian, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2018. Stencil codes on a vector length agnostic architecture. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT’18). ACM, New York, 12 pages.
[10]
Krste Asanović. 1998. Vector Microprocessors. Ph.D. Dissertation. University of California, Berkeley.
[11]
Krste Asanović and Roger Espasa. 2017. The RISC-V Vector ISA, 7th RISC-V Workshop.
[12]
R. Auler, P. C. Centoducatte, and E. Borin. 2012. ACCGen: An automatic archc compiler generator. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High-Performance Computing. 278–285.
[13]
John Backus. 1978. Can programming be liberated from the Von Neumann style?: A functional style and its algebra of programs. Communication of the ACM. 21, 8 (Aug. 1978), 613–641.
[14]
David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA programming for the masses. ACM Queue 11, 2, Article 40 (Feb. 2013), 13 pages.
[15]
David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Comput. Surv. 26, 4 (Dec. 1994), 345–420.
[16]
Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign (CODES'02). ACM, New York, 73–78.
[17]
Călin Bîră, Radu Hobincu, Lucian Petrică, Valeriu Codreanu, and Sorin Coţofană. 2014. Energy-efficient computation of L1 and L2 norms on FPGA SIMD accelerator, with applications to visual search. In Proceedings of the International Conference on Circuits, Systems, Communications and Computers (CSCC’14).
[18]
Călin Bîră, Lucian Petrică, and Radu Hobincu. 2013. OPINCAA: A lightweight and flexible programming environment for parallel SIMD accelerators. Romanian Journal of Information Science and Technology 16, 4 (2013).
[19]
Guy E. Blelloch. 1990. Vector Models for Data-parallel Computing. MIT Press, Cambridge, MA.
[20]
Robert L. Bocchino, Jr. and Vikram S. Adve. 2006. Vector LLVA: A virtual vector instruction set for media processing. In Proceedings of the 2nd International Conference on Virtual Execution Environments (VEE’06). ACM, New York, 46–56.
[21]
David Brooks and Margaret Martonosi. 1999. Dynamically exploiting narrow width operands to improve processor power and performance. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture (HPCA’99). IEEE Computer Society, Washington, DC, 13. http://dl.acm.org/citation.cfm?id=520549.822763.
[22]
A. Burlacu-Zane. 2015. Hardware loop and loop skip generation algorithm for the star core architecture: Architecture, application and compiler design interaction in the embedded domain. In Proceedings of the 20th International Conference on Control Systems and Computer Science (CSCS’15). 273–278.
[23]
G. J. Burnett and E. G. Coffman, Jr. 1970. A study of interleaved memory systems. In Proceedings of the Spring Joint Computer Conference (AFIPS’70). ACM, New York, 467–474.
[24]
C. Caşcaval, S. Chatterjee, H. Franke, K. J. Gildea, and P. Pattnaik. 2010. A taxonomy of accelerator architectures and their programming models. IBM J. Res. Dev. 54, 5 (Sept. 2010), 473–482.
[25]
A. P. Chandrakasan, S. Sheng, and R. W. Brodersen. 1992. Low-power CMOS digital design. IEEE J. Solid-State Circ. 27, 4 (Apr. 1992), 473–484.
[26]
Alex E. Şuşu. 2019. Compiling efficiently with arithmetic emulation for the custom-width connex vector processor. In Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing (WPMVP’19). ACM, New York, 8 pages.
[27]
Giovanni De Micheli, Rolf Ernst, and Wayne Wolf (Eds.). 2002. Readings in Hardware/Software Co-design. Kluwer Academic Publishers, Norwell, MA.
[28]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Vol. 6. USENIX Association, Berkeley, 10–10. http://dl.acm.org/citation.cfm?id=1251254.1251264.
[29]
Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy, Linda Torczon, and Andy White (Eds.). 2003. Sourcebook of Parallel Computing. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[30]
Alexandre E. Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. 2005. Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE Computer Society, Washington, DC, 161–172.
[31]
Miloš D. Ercegovac and Tomás Lang. 2003. Digital Arithmetic (1st ed.). Morgan Kaufmann. Publishers Inc., San Francisco, CA.
[32]
Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. 2000. Lx: A technology platform for customizable VLIW embedded processing. SIGARCH Computer Architecture News 28, 2 (2000), 203–213.
[33]
Roger Ferrer, Vicenç Beltran, Marc Gonzàlez, Xavier Martorell, and Eduard Ayguadé. 2010. Analysis of Task Offloading for Accelerators. Springer Berlin Heidelberg, Berlin, 322–336.
[34]
Joseph A. Fisher, Paolo Faraboschi, and Clifford Young. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[35]
Francesco Petrogalli. 2016. A Sneak Peek into SVE and VLA Programming, ARM White Paper.
[36]
Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 96–106.
[37]
Gheorghe M. Ştefan. 2015. The Connex Instruction Set Architecture. (document included in the OPINCAA library distribution).
[38]
Gheorghe M. Ştefan. 2019. Functional Electronics course. Retrieved from http://users.dcae.pub.ro/~gstefan/2ndLevel/functional_electronics.html.
[39]
C. Gou and G. N. Gaydadjiev. 2013. Addressing GPU On-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming 41 (2013), 400–429.
[40]
M. Annaratone, E. Arnould, T. Gross, H. T. Kung, and M. Lam. 1987. The warp computer: Architecture, implementation, and performance. IEEE Trans. Comput. 36, 12 (Dec. 1987), 1523–1538.
[41]
Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC transparent compilation to heterogeneous hardware. In Proceedings of the International Conference on Supercomputing (ICS’16). ACM, New York, 13 pages.
[42]
Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. 2006. Synergistic processing in cell’s multicore architecture. IEEE Micro 26, 2 (March 2006), 10–24.
[43]
Frank Hannig, Vahid Lari, Srinivas Boppu, Alexandru Tanase, and Oliver Reiche. 2014. Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Trans. Embed. Comput. Syst. 13, 4s, Article 133 (April 2014), 29 pages.
[44]
John Hauser. 2020. SoftFloat. Retrieved from http://www.jhauser.us/arithmetic/SoftFloat.html.
[45]
Arthur Hennequin, Ian Masliah, and Lionel Lacassagne. 2019. Designing efficient SIMD algorithms for direct connected component labeling. In Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing (WPMVP’19). ACM, New York, 8 pages.
[46]
John L. Hennessy and David A. Patterson. 2017. Computer Architecture, Sixth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[47]
H. Inoue. 2016. How SIMD width affects energy efficiency: A case study on sorting. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL CHIPS XIX). 1--3.
[48]
Raj Jain. 1991. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley. Retrieved from https://books.google.ro/books?id=eOR0kJjgMqkC.
[49]
Jeff Johnson. 2017. Making floating point math highly efficient for AI hardware. Retrieved from https://code.fb.com/ai-research/floating-point-math/.
[50]
Mahmut Kandemir, Ismail Kadayif, and Ugur Sezer. 2001. Exploiting scratch-pad memory using presburger formulas. In Proceedings of the 14th International Symposium on Systems Synthesis (ISSS'01). ACM, New York, 7–12.
[51]
Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, Heinrich Meyr, and Monu Kedia. 2006. Design and implementation of a modular and portable IEEE 754 compliant floating-point unit. In Proceedings of the Conference on Design, Automation and Test in Europe: Designers. Forum (DATE'06). European Design and Automation Association, Leuven, 221–226. http://dl.acm.org/citation.cfm?id=1131355.1131404
[52]
Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric multi-level blocking. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI’97). ACM, New York, 346--357.
[53]
Christoforos E. Kozyrakis and David A. Patterson. 2003. Scalable vector processors for embedded systems. IEEE Micro 23, 6 (Nov. 2003), 36--45.
[54]
Olaf Krzikalla, Kim Feldhoff, Ralph Müller-Pfefferkorn, and Wolfgang E. Nagel. 2011. Scout: A source-to-source transformator for SIMD-optimizations. In Proceedings of the International European Conference on Parallel and Distributed Computing (Euro-Par’11), Vol. 2. Springer-Verlag, Berlin, Heidelberg, 137--145.
[55]
Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA.
[56]
Ian Kuon and Jonathan Rose. 2006. Measuring the gap between FPGAs and ASICs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field-Programmable Gate Arrays (FPGA’06). ACM, New York, 21--30.
[57]
Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and detecting memory address congruence. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’02). IEEE Computer Society, Washington, DC, 18--29. http://dl.acm.org/citation.cfm?id=645989.674329.
[58]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75. http://dl.acm.org/citation.cfm?id=977395.977673.
[59]
Lian Li, Hui Wu, Hui Feng, and Jingling Xue. 2007. Towards data tiling for whole programs in scratchpad memory allocation. In Proceedings of the 12th Asia-Pacific Conference on Advances in Computer Systems Architecture (ACSAC’07). Springer-Verlag, Berlin, Heidelberg, 63--74. http://dl.acm.org/citation.cfm?id=2392163.2392171
[60]
Haibo Lin, Tao Liu, Lakshminarayanan Renganarayana, Huoding Li, Tong Chen, Kevin O'Brien, and Ling Shao. 2011. Automatic loop tiling for direct memory access. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). 479--489.
[61]
Haibo Lin, Tao Liu, Huoding Li, Tong Chen, Lakshminarayanan Renganarayana, John Kevin O'Brien, and Ling Shao. 2010. DMATiler: Revisiting loop tiling for direct memory access. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, 559--560.
[62]
The Linux Kernel Archives. 2014. Linux Socket Filtering aka Berkeley Packet Filter (BPF). Retrieved from https://www.kernel.org/doc/Documentation/networking/filter.txt.
[63]
Tao Liu, Haibo Lin, Tong Chen, John Kevin O’Brien, and Ling Shao. 2009. DBDB: Optimizing DMATransfer for the cell be architecture. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, 36--45.
[64]
Bruno Cardoso Lopes and Rafael Auler. 2014. Getting Started with LLVM Core Libraries. Packt Publishing.
[65]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (Aug. 2016), 18 pages.
[66]
Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. 2000. Smart memories: A modular reconfigurable architecture. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, 161--171.
[67]
M. Maliţa and G. M. Ştefan. 2017. Map-scan node accelerator for big-data. In Proceedings of the IEEE International Conference on Big Data (Big Data’17). 3524--3529.
[68]
M. Malita, G. M. Ştefan, and M. Stoian. 2006. Complex vs. intensive in parallel computation. In Proceedings of the International Multi-Conference on Computing in the Global Information Technology (ICCGI’06). 26--26.
[69]
Steven McCanne and Van Jacobson. 1993. The BSD packet filter: A new architecture for user-level packet capture. In Proceedings of the USENIX Annual Technical Conference (USENIX’93). USENIX Association, Berkeley, 2--2. http://dl.acm.org/citation.cfm?id=1267303.1267305.
[70]
Michael McCool, James Reinders, and Arch Robison. 2012. Structured Parallel Programming: Patterns for Efficient Computation (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.
[71]
Gleison Mendonça, Breno Guimarães, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintão Pereira. 2017. DawnCC: Automatic annotation for data parallelism and offloading. ACM Trans. Archit. Code Optim. 14, 2, Article 13 (May 2017), 25 pages.
[72]
Giovanni De Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw–Hill Higher Education.
[73]
Sparsh Mittal. 2017. A survey of techniques for architecting and managing GPU register file. IEEE Trans. Parallel Distrib. Syst. 28, 1 (Jan. 2017), 16--28.
[74]
Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[75]
Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, James Fung, and Dan Ginsburg. 2011. OpenCL Programming Guide (1st ed.). Addison-Wesley Professional.
[76]
Dorit Naishlos. 2004. Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit.
[77]
V. Krishna Nandivada and Rajkishore Barik. 2013. Improved bitwidth-aware variable packing. ACM Trans. Archit. Code Optim. 10, 3, Article 16 (Sept. 2013), 22 pages.
[78]
Henrique Nazaré, Izabela Maffra, Willer Santos, Leonardo Barbosa, Laure Gonnord, and Fernando Magno Quintão Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the ACM International Conference on Object-Oriented Programming, Systems, Languages 8 Applications (OOPSLA’14). ACM, New York, 791--809.
[79]
Dorit Nuzman and Richard Henderson. 2006. Multi-platform auto-vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, Washington, DC, 281.294.
[80]
Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 151--160. http://dl.acm.org/citation.cfm?id=2190025.2190062.
[81]
NVIDIA. 2018. NVIDIA Turing GPU Architecture, Graphics Reinvented. White paper WP-09183-001_v01. Retrieved from http://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turingarchitecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
[82]
Mayur Pandey and Suyog Sarda. 2015. LLVM Cookbook. Packt.
[83]
David A. Patterson and John L. Hennessy. 2013. Computer Organization and Design: The Hardware/Software Interface (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.
[84]
David A. Patterson and John L. Hennessy. 2017. Computer Organization and Design RISC-V Edition: The Hardware Software Interface (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.
[85]
L. Petrică, V. Codreanu, S. Coţofană. 2013. VASILE: A reconfigurable vector architecture for instruction level frequency scaling. In Faible Tension Faible Consommation (FTFC'13). IEEE, 1--4.
[86]
Louis-Noël Pouchet. 2014. PolyBench: The Polyhedral Benchmark Suite. Retrieved from https://web.cse.ohiostate.edu/~pouchet.2/software/polybench/.
[87]
Randolf G. Scarborough and Harwood G. Kolsky. 1986. A vectorizing fortran compiler. IBM J. Res. Dev. 30, 2 (March 1986), 163--171.
[88]
Selim G. Akl. 1989. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Upper Saddle River, NJ.
[89]
J. P. Shen and M. H. Lipasti. 2005. Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press. Retrieved from https://books.google.ro/books?id=ffQqAAAAQBAJ.
[90]
Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical bounds for optimal tile size selection. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). Springer-Verlag, Berlin, Heidelberg, 101--121.
[91]
Moritz Sinn, Florian Zuleger, and Helmut Veith. 2017. Complexity and resource bound analysis of imperative programs using difference constraints. Journal of Automated Reasoning 59, 1 (June 2017), 3--45.
[92]
David B. Skillicorn and Domenico Talia. 1998. Models and languages for parallel computation. ACM Comput. Surv. 30, 2 (June 1998), 123--169.
[93]
Gheorghe M. Ştefan and Mihaela Maliţa. 2014. Can one-chip parallel computing be liberated from ad hoc solutions? A computation model based approach and its implementation. In Proceedings of the 18th International Conference on Circuits, Systems, Communications and Computers (CSCC’14). 582--597.
[94]
Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (March 2017), 26--39.
[95]
J. Teubner, R. Mueller, and G. Alonso. 2010. FPGA acceleration for the frequent item problem. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 669--680.
[96]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages.
[97]
Luc Waeijen, Dongrui She, Henk Corporaal, and Yifan He. 2015. A low-energy wide SIMD architecture with explicit datapath. J. Sign. Process. Syst. 80, 1 (July 2015), 65--86.
[98]
Andrew Waterman. 2016. Design of the RISC-V Instruction Set Architecture. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1.html.
[99]
Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanović. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html.
[100]
Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA.

Cited By

View all
  • (2024)MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latencyFrontiers of Computer Science10.1007/s11704-023-3566-y19:5Online publication date: 22-Nov-2024
  • (2023)Mira: A Program-Behavior-Guided Far Memory SystemProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613157(692-708)Online publication date: 23-Oct-2023
  • (2022)Compiling for Vector Extensions With Stream-Based SpecializationIEEE Micro10.1109/MM.2022.317340542:5(49-58)Online publication date: 1-Sep-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 19, Issue 6
Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing Compilers
November 2020
271 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3427195
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 03 October 2020
Accepted: 01 June 2020
Revised: 01 March 2020
Received: 01 November 2019
Published in TECS Volume 19, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Connex-S vector (array) accelerator
  2. LLVM
  3. OPINCAA JIT vector assembler and coordination library
  4. quantitative static analysis
  5. vector-length agnostic compiler for the custom-width Connex-S accelerator
  6. vectorization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latencyFrontiers of Computer Science10.1007/s11704-023-3566-y19:5Online publication date: 22-Nov-2024
  • (2023)Mira: A Program-Behavior-Guided Far Memory SystemProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613157(692-708)Online publication date: 23-Oct-2023
  • (2022)Compiling for Vector Extensions With Stream-Based SpecializationIEEE Micro10.1109/MM.2022.317340542:5(49-58)Online publication date: 1-Sep-2022
  • (2022)Compilation of Parallel Data Access for Vector Processor in Radio Base StationsIEEE Embedded Systems Letters10.1109/LES.2021.308566414:1(11-14)Online publication date: Mar-2022
  • (2022)Python-Based Programming Framework for a Heterogeneous MapReduce Architecture2022 14th International Conference on Communications (COMM)10.1109/COMM54429.2022.9817183(1-6)Online publication date: 16-Jun-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media