Abstract
We propose automatic synthesis of application specific instruction set processors (ASIPs). We use pipeline execution of multi-op machine-instructions, e.g., \(*({ reg}1*{ reg}2) = (*{ reg}3)+(*{ reg}4)\) (C-syntax) an instruction with three memory pipeline stages and two arithmetic stages. The problem is, for a given set of loops, to find a pipeline configuration and a multi-op ISA that maximizes the IPC (instructions per cycle) while minimizing the resource usage and the cost of interconnections to the register-file of the resulting CPU. The algorithm is based on finding an efficient cover of a large graph by a small set of convex sub-graphs (called \(g_i\)s) that are consistent with a given set of pipeline units. Unlike previous works, \(g_i\)s are not synthesized to circuits that are executed in a co-processor mode but rather both \(g_i\)s and the rest of the program are executed by the same set of multiop pipeline units. In this way we eliminate the overhead associated with the co-processor mode of regular ASIPs but maintain high values of IPC of these ASIPs. The main advantage of using pipeline execution of multi-op versus VLIW instructions is shown to be the cost of interconnections between the CPU’s execution units and the register file. Once the pipeline configuration and the cover \(g_1 \cup \cdots \cup g_n=G\) has been computed the Verilog RTL of the corresponding CPU (extended with branch instructions) is generated and synthesized to FPGA. The results show that, for a set of selected kernels, the resulting ASIP (called Ocpu) obtains higher IPC values compare to an equivalent compilation to an ARM cpu while obtaining similar clock frequencies.
Similar content being viewed by others
Notes
Using Vivado \(+\) Kintex-7 we compared the Ocpu \(p=2\) \(\hbox {k}=5\) with Amber (A free clone of ARM-7) and obtained that the Ocpu required 4.73 versus 1.2 W for the Amber cpu. This is reasonable as the Ocpu contains 10times more functional operations than the Amber.
Using a similar technique to the one used in Dilworth’s theorem [13] wherein it was shown that a DAG G of width K (analogue to a VLIW execution) can be covered by K chains (analog to a pipeline execution).
References
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers Principles, Techniques and Tools. Addison-Wesley, Reading, MA (1986)
Atasu, K., Pozzi, L., Ienne, P.: Automatic application-specific instruction-set extensions under microarchitectural constraints. In: Proceedings of the 40th Annual Design Automation Conference (2003)
Battista, G.D., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing: Algorithms for the Visualization of Graphs. Prentice Hall PTR, Upper Saddle River (1998)
Ben-Asher, Y., Lipov, I., Tartakovsky, V., Tiv, D.: Using multi-op instructions as a way to generate aggressive asips. In: The 22nd IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM (POSTER) (2014)
Biswas, P., Dutt, N.D.: Code size reduction in heterogeneous-connectivity-based DSPs using instruction set extensions. IEEE Trans. Comput. 54, 1216–1226 (2005)
Biswas, P., Dutt, N.D., Pozzi, L., Ienne, P.: Introduction of architecturally visible storage in instruction set extensions. IEEE Trans. Comput-Aided Des. Integr. Circuits Syst. 26(3), 435–446 (2007)
Bollobás, B., Brightwell, G.: The height of a random partial order: concentration of measure. Ann. Appl. Probab. 2(4), 1009–1018 (1992)
Callahan, T.J., Hauser, J.R., Wawrzynek, J.: The garp architecture and C compiler. Computer 33, 62–69 (2000)
Chattopadhyay, A., Ahmed, W., Karari, K., Kammler, D., Leupers, R., Ascheid, G., Meyr, H.: Design space exploration of partially re-configurable embedded processors. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE’07 (2007)
Clark, N.T., Zhong, H., Mahlke, S.A.: Automated custom instruction generation for domain-specific processor acceleration. IEEE Trans. Comput. 54(10), 1258–1270 (2005)
Cong, J., Fan, Y., Han, G., Zhang, Z.: Application-specific instruction generation for configurable processor architectures. In: Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (2004)
Cong, J., Han, G., Zhang, Z.: Architecture and compiler optimizations for data bandwidth improvement in configurable processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 14(9), 986–997 (2006)
Dilworth, R.P.: A decomposition theorem for partially ordered sets. Ann. Math. 51(1), 161166 (1950)
Galuzzi, C., Bertels, K.: The instruction-set extension problem: a survey. In: Reconfigurable Computing: Architectures, Tools and Applications, pp. 209–220. Springer (2008)
Hauck, S., Fry, T.W., Hosler, M.M., Kao, J.P.: The chimaera reconfigurable functional unit. In: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines (1997)
Jain, M.K., Balakrishnan, M., Kumar, A.: Asip design methodologies: survey and issues. In: Proceedings of the The 14th International Conference on VLSI Design (VLSID’01) (2001)
Kastner, R., Kaplan, A., Memik, S.O., Bozorgzadeh, E.: Instruction generation for hybrid reconfigurable systems. ACM Trans. Des. Autom. Electron. Syst. 7, 605–627 (2002)
Kohler, S., Braunes, J., Spallek, R.G., Sawitzki, S.: Improving code efficiency for reconfigurable vliw processors. In: IEEE Computer Society (IPDPS.2002) (2002)
Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California, March (2004)
Leibson, S.: Designing SOCs with Configured Cores: Unleashing the Tensilica Xtensa and Diamond Cores. Morgan Kaufmann, Los Altos, CA (2006)
Liao, S., Devadas, S., Keutzer, K., Tjiang, S.: Instruction selection using binate covering for code size optimization. In: Conference on Computer-Aided Design, ICCAD-95, pp. 393–399. IEEE (1995)
Peymandoust, A., Pozzi, L., Ienne, P., De Micheli, G.: Automatic instruction set extension and utilization for embedded processors. In: Application-Specific Systems, Architectures, and Processors, 2003. Proceedings. IEEE International Conference on, pp. 108–118. IEEE (2003)
Pozzi, L., Ienne, P.: Exploiting pipelining to relax register-file port constraints of instruction-set extensions. In: Proceedings of the 2005 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES’05 (2005)
Pricopi, M., Mitra, T.: Bahurupi: a polymorphic heterogeneous multi-core architecture. ACM Trans. Archit. Code Optim. (TACO) 8(4), 22 (2012)
Radhakrishnan, S., Guo, H., Parameswaran, S.: Dual-pipeline heterogeneous asip design. In: Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (2004)
Thite, S.: On covering a graph optimally with induced subgraphs. ArXiv preprint arXiv:cs/0604013 (2006)
VanAken, J.R., Zick, G.L.: The expression processor: a pipelined, multiple-processor architecture. IEEE Trans. Comput. 100(8), 525–536 (1981)
Verkest, D., Van R, Karl, Bolsens, I., De Man, H.: Coware design environment for heterogeneous hardware/software systems. Des. Autom. Embed. Syst. 1(4), 357–386 (1996)
Villa, T., Kam, T., Brayton, R.K., Sangiovanni-Vincenteili, A.L.: Explicit and implicit algorithms for binate covering problems. IEEE Trans. Comput-Aided Des. Integr. Circuits Syst. 16(7), 677–691 (1997)
Yu, P., Mitra, T.: Disjoint pattern enumeration for custom instructions identification. In: Field Programmable Logic and Applications, 2007. FPL 2007. International Conference on, pp. 273–278. IEEE (2007)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by the Israel Ministry of Science, Grant No. 3-10894.
Rights and permissions
About this article
Cite this article
Ben Asher, Y., Lipov, I., Tartakovsky, V. et al. Generating ASIPs with Reduced Number of Connections to the Register-File. Int J Parallel Prog 45, 1461–1487 (2017). https://doi.org/10.1007/s10766-017-0491-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-017-0491-4