Hardware–Software Co-Design for Decimal Multiplication
<p>Area-delay relationship with precision for decimal arithmetic.</p> "> Figure 2
<p>Decimal multiplication process. Hardware–software co-design solutions are considered in the dotted rectangle in this paper.</p> "> Figure 3
<p>Software decimal multiplication.</p> "> Figure 4
<p>Hardware decimal multiplication.</p> "> Figure 5
<p>Relationship among binary-coded decimal (BCD), densely packed decimal (DPD), and base-billion formats.</p> "> Figure 6
<p>Flow of the proposed methods: (<b>a</b>) Method-1, (<b>b</b>) Method-2, (<b>c</b>) Method-3, and (<b>d</b>) Method-4. White and gray blocks indicate software parts and hardware parts, respectively.</p> "> Figure 7
<p>The 64-bit partial product selector.</p> "> Figure 8
<p>Parallel Accumulator for 64-bit operation.</p> "> Figure 9
<p>Base-billion to BCD converter.</p> "> Figure 10
<p>Overview of RISC-V-based evaluation environment. The dotted region is the framework proposed in Reference [<a href="#B21-computers-10-00017" class="html-bibr">21</a>].</p> "> Figure 11
<p>High-level architecture of Rocket Chip with Rocket Custom Coprocessor (RoCC) interface with accelerator.</p> "> Figure 12
<p>Execution cycle distributions for the proposed methods and the software solution for a 64-bit format. Curves of probability density functions for the proposed methods and software and full hardware solutions are depicted.</p> "> Figure 13
<p>Comparison of the execution cycles for different input types. Each bar shows the average execution cycle for each pair of solution and input type. The black color indicates the cycles used by the hardware, e.g., the bar with blue and black colors shows the total execution time for Method-1, and the black area shows the cycles for the hardware.</p> "> Figure 14
<p>Area-delay tradeoff for the 64-bit precision.</p> ">
Abstract
:1. Introduction
- Method-1
- is the most area-efficient solution, where the addition is supported by decimal hardware, and most operations are processed sequentially. It does not use any binary arithmetic and hence does not require time-consuming decimal-to-binary or binary-to-decimal conversion. It accelerates the execution process efficiently compared with software-based solutions.
- Method-2
- achieves the highest speed where the sequential accumulation using a decimal adder in Method-1 is replaced by a parallel decimal multiplier. However, a large area overhead is required.
- Method-3
- partially relies on binary arithmetic, where only the multiplicand suffers decimal-to-binary and binary-to-decimal conversion, and a parallel decimal multiplier is used as in Method-2. It can achieve moderate execution process speedup and requires a large area overhead.
- Method-4
- achieves execution process speedup by adopting a hardware binary-to-decimal converter. Basic operations for multiplication rely on well-optimized binary arithmetic. It can also achieve well acceleration with a relatively low area overhead.
2. Decimal Floating-Point Multiplication
- The sign and temporal exponent are calculated from the signs and exponents of X and Y.
- A coefficient multiplication is performed.
- Finally, the exponents are adjusted accordingly.
2.1. Software Library
2.2. Hardware Solutions
3. Proposed Methods for Decimal Multiplication
3.1. Method-1
3.2. Method-2
3.3. Method-3
3.4. Method-4
3.5. Decimal Hardware Component Design
3.5.1. BCD-CLA
3.5.2. Partial Product Selector
3.5.3. Parallel Accumulator
3.5.4. BCD Converter
3.5.5. Rounding Logic
4. Experimental Results
4.1. Experiment Setup
4.2. Area and Delay Tradeoff
4.3. Discussion
5. Conclusions
Future Vision
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
DPD | Densely packed decimal |
BID | Binary integer decimal |
BCD | Binary codded decimal |
DFP | Decimal floating-point |
PP | Partial product |
PPG | Partial product generation |
PPA | Partial product accumulation |
CLA | Carry-lookahead adder |
CSA | Carry-save adder |
PA | Parallel accumulator |
SW | Software-based decimal computing |
HW | Hardware intensive decimal computing |
References
- Cowlishaw, M.F. Decimal floating-point: Algorism for computers. In Proceedings of the IEEE International Symposium on Computer Arithmetic, Santiago de Compostela, Spain, 15–18 June 2003; pp. 104–111. [Google Scholar]
- IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008. 2008. Available online: https://standards.ieee.org/standard/754-2008.html (accessed on 11 June 2019).
- Oracle America, Inc. Java Class BigDecimal. 2015. Available online: http://javadoc.scijava.org/Java/java/math/BigDecimal.html (accessed on 11 June 2019).
- Cowlishaw, M. The decNumber C library, Version 3.68. 2010. Available online: http://speleotrove.com/decimal/decnumber.html (accessed on 11 June 2019).
- Cornea, M. Intel Decimal Floating-Point Math Library. 2011. Available online: https://software.intel.com/en-us/articles/intel-decimal-floating-point-math-library (accessed on 11 June 2019).
- Schulte, M.J.; Lindberg, N.; Laxminarain, A. Performance evaluation of decimal floating-point arithmetic. In Proceedings of the IBM Austin Center for Advanced Studies Conference, Austin, TX, USA, 18–20 October 2005; pp. 1–4. [Google Scholar]
- Schwarz, E.M.; Kaepernick, J.S.; Cowlishaw, M.F. The IBM z900 decimal arithmetic unit. In Proceedings of the Asilomar Conference on Signals Systems and Computers, Pacific Grove, CA, USA, 4–7 November 2001; pp. 1335–1339. [Google Scholar]
- Schwarz, E.M.; Kapernick, J.S.; Cowlishaw, M.F. Decimal floating-point support on the IBM System z10 processor. IBM J. Res. Dev. 2009, 53, 4.1–4.10. [Google Scholar] [CrossRef]
- Carlough, S.; Collura, A.; Mueller, S.; Kroener, M. The IBM zEnterprise-196 decimal floating-point accelerator. In Proceedings of the IEEE International Symposium on Computer Arithmetic, Tubingen, Germany, 25–27 July 2011; pp. 139–146. [Google Scholar]
- Eisen, L.; Ward, J.W.; Test, H.; Mading, N.; Leenstra, J.; Mueller, S.M.; Jacobi, C.; Preiss, J.; Schwarz, E.M.; Carlough, S.R. IBM POWER6 accelerators: VMX and DFU. IBM J. Res. Dev. 2007, 51, 663–683. [Google Scholar] [CrossRef]
- Yoshida, T.; Maruyama, T.; Akizuki, Y.; Kan, R.; Kiyota, N.; Ikenishi, K.; Itou, S.; Watahiki, T.; Okano, H. SPARC64 X: Fujitsu’s new-generation 16-core processor for Unix servers. IEEE Micro 2013, 33, 16–24. [Google Scholar] [CrossRef]
- Wang, L.K.; Erle, M.A.; Tsen, C.; Schwarz, E.M.; Schulte, M.J. A survey of hardware designs for decimal arithmetic. IBM J. Res. Dev. 2010, 54, 8.1–8.15. [Google Scholar] [CrossRef]
- A, E.M.; Schulte, M.J.; Linebarger, J.M. Potential speedup using decimal floating-point hardware. In Proceedings of the Asilomar Conference on Signals Systems and Computers, Pacific Grove, CA, USA, 3–6 November 2002; pp. 1073–1077. [Google Scholar]
- A, M.E.; Schulte, M.J.; Hickmann, B.J. Decimal floating-point multiplication via carry-save addition. In Proceedings of the IEEE International Symposium on Computer Arithmetic, Montepellier, France, 25–27 June 2007; pp. 46–55. [Google Scholar]
- Erle, M.A.; Hickmann, B.J.; Schulte, M.J. Decimal floating-point multiplication. IEEE Trans. Comput. 2009, 58, 902–916. [Google Scholar] [CrossRef]
- Gorgin, S.; Jaberipur, G. Fully redundant decimal arithmetic. In Proceedings of the IEEE International Symposium on Computer Arithmetic, Portland, OR, USA, 8–10 June 2009; pp. 145–152. [Google Scholar]
- Cui, X.; Lombardi, F. A parallel decimal multiplier using hybrid binary coded decimal (BCD) codes. In Proceedings of the IEEE International Symposium on Computer Arithmetic, Santa Clara, CA, USA, 10–13 July 2016; pp. 150–155. [Google Scholar]
- Neto, H.C.; Vestias, M.P. Decimal multiplier on FPGA using embedded binary multiplier. In Proceedings of the International Conference on Field Programmable Logic and Application, Heidelberg, Germany, 8–10 September 2008; pp. 197–202. [Google Scholar]
- Bhattacharya, J.; Gupta, A.; Singh, A. A High Performance Binary to BCD Converter for Decimal Multiplication. In Proceedings of the 2010 International Symposium on VLSI Design, Automation and Test2010, Hsin Chu, Taiwan, 26–29 April 2010; pp. 315–318. [Google Scholar]
- Al-Khaleel, O.; Al-Qudah, Z.; Al-Khaleel, M.; Papachristou, C.A.; Wolff, F.G. Fast and compact binary-to-BCD conversion circuits for decimal multiplication. In Proceedings of the International Conference on Computer Design, Amherst, MA, USA, 9–12 October 2011; pp. 226–231. [Google Scholar]
- Mian, R.; Shintani, M.; Inoue, M. Cycle-accurate evaluation of software-hardware co-design of decimal computation in RISC-V ecosystem. In Proceedings of the IEEE International System on Chip Conference, Singapore, 3–6 September 2019; pp. 412–417. [Google Scholar]
- Cowlishaw, M. Densely packed decimal encoding. IEE Proc. Comput. Digit. Tech. 2002, 149, 102–104. [Google Scholar] [CrossRef]
- Anderson, M.J.; Tsen, C. Performance analysis of decimal floating-point libraries and its impact on decimal hardware and software solutions. In Proceedings of the IEEE International Conference on Computer Design, Lake Tahoe, CA, USA, 4–7 October 2009; pp. 465–471. [Google Scholar]
- Cornea, M.; Harrison, J.; Anderson, C.; Tak, P.; Tang, P. A software implementation of the IEEE 754R decimal floating-point arithmetic using the binary encoding format. IEEE Trans. Comput. 2009, 58, 148–162. [Google Scholar] [CrossRef] [Green Version]
- Gonzalez-Navarro, S.; Tsen, C.; Schulte, M.J. Binary integer decimal-based floating-point multiplication. IEEE Trans. Comput. 2013, 62, 1460–1466. [Google Scholar] [CrossRef]
- Cornea, M. IEEE 754-2008 Decimal floating-point for intel architecture processors. In Proceedings of the IEEE International Symposium on Computer Arithmetic, Portland, OR, USA, 8–10 June 2019; pp. 325–328. [Google Scholar]
- Gorgin, S.; Jaberipur, G. Sign-magnitude encoding for efficient VLSI realization of decimal multiplication. IEEE Trans. Very Large Scale Integr. Syst. 2017, 25, 75–86. [Google Scholar] [CrossRef]
- Vazquez, A.; Antelo, E.; Bruguera, J.D. Fast radix-10 multiplication using redundant BCD codes. IEEE Trans. Comput. 2014, 63, 1902–1914. [Google Scholar] [CrossRef]
- Jaberipur, G.; Kaivani, A. Improving the speed of parallel decimal multiplication. IEEE Trans. Comput. 2009, 58, 1539–1552. [Google Scholar] [CrossRef]
- Cui, X.; Dong, W.; Liu, W.; Swartzlander, E.E.; Lombardi, F. High performance parallel decimal multipliers using hybrid BCD codes. IEEE Trans. Comput. 2017, 66, 1994–2004. [Google Scholar] [CrossRef]
- Zhu, M.; Jiang, Y.; Yang, M.; Chen, T. On high-performance parallel decimal fixed-point multiplier designs. Comput. Electr. Eng. 2014, 40, 2126–2138. [Google Scholar] [CrossRef] [Green Version]
- Vazquez, A.; Antelo, E.; Montusshi, P. Improved design of high-performance parallel decimal multipliers. IEEE Trans. Comput. 2010, 59, 679–693. [Google Scholar] [CrossRef]
- The RISC-V Instruction Set Manual, Volume i: Base User-Level ISA. 2011. Available online: https://riscv.org/ (accessed on 16 August 2019).
- Asanović, K. The Rocket Chip Generator; Technical Report UCB/EECS-2016-17; EECS Department, University of California: Berkeley, CA, USA, 2016. [Google Scholar]
- Design Compiler User Guide Version I-2013.06. Available online: https://www.synopsys.com/implementation-and-signoff.html (accessed on 11 June 2019).
- Goldman, R.; Bartleson, K.; Wood, T.; Kranen, K.; Cao, C.; Melikyan, V.; Markosyan, G. Synopsys’ open educational design kit: Capabilities, deployment and future. In Proceedings of the IEEE International Conference on Microelectronic Systems Education, San Francisco, CA, USA, 25–27 July 2009; pp. 20–24. [Google Scholar]
- Sayed-Ahmed, A.A.R.; Fahmy, H.A.H.; Hassan, M.Y. Three engines to solve verification constraints of decimal floating-point operation. In Proceedings of the Asilomar Conference on Signals Systems and Computers, Pacific Grove, CA, USA, 7–10 November 2010; pp. 1153–1157. [Google Scholar]
- Sayed-Ahmed, A.A. Verification of Decimal Floating-Point Operations. Master’s Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt, 2011. [Google Scholar]
- IEEE Standard for Floating-Point Arithmetic; IEEE Std-754-2019 (Revision IEEE-754-2008); IEEE: Piscataway, NJ, USA, 2019; pp. 1–84. [CrossRef]
Method | Block | Component |
---|---|---|
Method-1, 2 | pp[i + 1] = pp[i] + pp[1] | |
Method-1 | Decimal left shift result | BCD-CLA |
result = result + pp[k] | ||
Method-2, 3 | Select partial products | Partial Product Selector |
Method-2, 3 | Accumulate Partial Products | Parallel Accumulator |
Method-2, 3 | Round result | Rounding Logic |
Method-3, 4 | Convert base billion to BCD | BCD Converter |
Method | Avg. Number of Cycles | Speed Up | Performance | |
---|---|---|---|---|
Hardware | Total | Loss | ||
Software [4] | 0.00 | 4078.83 | - | 64.9% |
Method-1 | 509.12 | 2288.74 | 1.78× | 37.4% |
Method-2 | 286.91 | 1723.72 | 2.37× | 16.8% |
Method-3 | 337.92 | 2420.01 | 1.69× | 40.8% |
Method-4 | 106.32 | 2852.50 | 1.43× | 49.7% |
Hardware intensive | 86.40 | 1433.60 | 2.85× | - |
M | Architecture | 64-Bit Format | 128-Bit Format | ||
---|---|---|---|---|---|
A (m) | D (ns) | A (m) | D (ns) | ||
[27] | Signed-digit | 109,822 | 20.91 | 380,336 | 35.90 |
[30] | XS-3 | 106,078 | 19.39 | 371,940 | 25.21 |
[31] | 8421-BCD | 137,593 | 43.00 | 276,545 | 99.58 |
M-1 | BCD-CLA | 5459 | 63.00 | 11,270 | 117.10 |
M-2 | BCD-CLA, PPS | 88,290 | 67.33 | 228,094 | 123.66 |
M-3 | Converter, PPS | 99,428 | 4.66 | 250,734 | 6.56 |
M | Architecture | 64-Bit Format | 128-Bit Format | ||
---|---|---|---|---|---|
A (m) | D (ns) | A (m) | D (ns) | ||
[27] | Signed-digit | 93,794 | 26.46 | 207,720 | 33.14 |
[30] | Hybrid BCD | 94,268 | 36.68 | 230,929 | 130.25 |
[31] | 8421 CLA tree | 87,729 | 59.34 | 225,982 | 123.87 |
M-1 | BCD-CLA | 5459 | 112.00 | 11,270 | 416.40 |
M-2 | PA | 87,729 | 59.34 | 225,982 | 123.87 |
M-3 | PA | 87,729 | 59.34 | 225,982 | 123.87 |
M | Architecture | 64-Bit Format | 128-Bit Format | Ratio (128/64) | |||||
---|---|---|---|---|---|---|---|---|---|
A (m) | R (%) | D (ns) | A (m) | R (%) | D (ns) | A | D | ||
[27] | Signed-digit | 216,789 | −1.53 | 47.37 | 614,779 | −16.16 | 69.04 | 2.89 | 1.46 |
[30] | XS-3 and Hybrid BCD | 213,519 | 0.00 | 56.07 | 629,592 | −18.95 | 155.46 | 3.01 | 2.77 |
[31] | 8421 addition, CLA tree | 238,495 | −10.47 | 102.34 | 529,250 | 0.00 | 229.83 | 2.23 | 2.25 |
M-1 | BCD-CLA | 5459 | 97.44 | 175.00 | 11,270 | 97.87 | 533.50 | 2.06 | 3.05 |
M-2 | BCD-CLA, PP selector, PA | 187,308 | 12.27 | 126.67 | 476,653 | 9.93 | 247.53 | 2.58 | 1.95 |
M-3 | PP selector, PA | 198,446 | 7.06 | 64.00 | 499,293 | 5.66 | 130.43 | 2.55 | 2.04 |
M-4 | BB to BCD converter | 16,596 | 92.22 | 42.53 | 33,910 | 93.59 | 150.60 | 2.04 | 3.54 |
Method | Avg. # Cycles | Area (m) |
---|---|---|
Software [4] | 4078.83 | - |
DPD to base-billion conversion | 606.36 | - |
exception check (NaN, Infinity, etc.) | 109.32 | - |
base-billion multiplication | 647.66 | - |
base-billion to BCD conversion | 1474.68 | - |
rounding | 795.7 | - |
Method-1 (total) | 2288.74 | 5459 |
addition in PPG | 178.48 | 5459 |
shift & addition in partial product accumulation | 330.63 | - |
Method-2 (total) | 1723.72 | 187,308 |
addition in Partial product generation | 178.48 | 5459 |
Partial product selection | 81.22 | 82,831 |
Partial product accumulation | 87,727 | |
rounding | 27.20 | 11,288 |
Method-3 (total) | 2420.01 | 198,446 |
conversion to BCD | 229.50 | 16,596 |
Partial product selection | 81.22 | 82,831 |
Partial product accumulation | 87,729 | |
rounding | 27.20 | 11,288 |
Method-4 (total) | 2852.50 | 16,596 |
conversion to BCD | 106.32 | 16,596 |
Hardware intensive | 1433.60 | 235,424 |
coefficient multiplication | 86.40 | 235,424 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mian, R.-u.-h.; Shintani, M.; Inoue, M. Hardware–Software Co-Design for Decimal Multiplication. Computers 2021, 10, 17. https://doi.org/10.3390/computers10020017
Mian R-u-h, Shintani M, Inoue M. Hardware–Software Co-Design for Decimal Multiplication. Computers. 2021; 10(2):17. https://doi.org/10.3390/computers10020017
Chicago/Turabian StyleMian, Riaz-ul-haque, Michihiro Shintani, and Michiko Inoue. 2021. "Hardware–Software Co-Design for Decimal Multiplication" Computers 10, no. 2: 17. https://doi.org/10.3390/computers10020017
APA StyleMian, R.-u.-h., Shintani, M., & Inoue, M. (2021). Hardware–Software Co-Design for Decimal Multiplication. Computers, 10(2), 17. https://doi.org/10.3390/computers10020017