A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming

337 Accesses
Explore all metrics

Abstract

The increased adoption of DNN applications drove the emergence of dedicated tensor computing units to accelerate multi-dimensional matrix multiplication operations. Although they deploy highly efficient computing architectures, they often lack support for more general-purpose application domains. Such a limitation occurs both due to their consolidated computation scheme (restricted to matrix multiplication) and due to their frequent adoption of low-precision/custom floating-point formats (unsuited for general application domains). In contrast, this paper proposes a new Reconfigurable Tensor Unit (RTU) which deploys an array of variable-precision Vector Multiply-Accumulate (VMA) units. Furthermore, each VMA unit leverages the new Posit floating-point format and supports the full range of standardized posit precisions in a single SIMD unit, with variable vector-element width. Moreover, the proposed RTU explores the Posit format features for fused operations, together with spatial and time-multiplexing reconfiguration mechanisms to fuse and combine multiple VMAs to map high-level and complex operations. The proposed RTU is also supported by an automatic data streaming infrastructure and a pipelined data transfer scheme, allowing it to accelerate the computation of most data-parallel patterns commonly present in vectorizable applications. The proposed RTU showed to outperform state-of-the-art tensor and SIMD units present in off-the-shelf platforms, and with dedicated FPGA-based accelerators, in turn resulting in significant energy-efficiency improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Randomized tensor decomposition using parallel reconfigurable systems

Article Open access 25 February 2025

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

High Performance Tensor–Vector Multiplication on Shared-Memory Systems

Notes

Although it is out of the scope of this work, to deploy a VMA (or the RTU) either as a CPU functional unit or as a dedicated accelerator, it is only required to connect each controller to a centralized mechanism to facilitate its programming.
The adopted NVIDIA tensor core was used as a representative platform in the domain of tensor accelerators not only due to its accessibility, but also because it consists on a fair and valid comparison basis since its topology is close to that of the RTU base architecture.

References

Dean, J., Patterson, D., & Young, C. (2018). A new golden age in computer architecture: Empowering the machine-learning revolution. IEEE Micro, 38(2), 21–29.
Article Google Scholar
Hennessy, J. L., & Patterson, D. A. (2019). A new golden age for computer architecture. Communications of the ACM, 62(2), 48–60.
Article Google Scholar
Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., et al. (2018). Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro, 38(2), 8–20.
Article Google Scholar
Delaye, E., Sirasao, A., Dudha, C., & Das, S. (2017). Deep learning challenges and solutions with xilinx fpgas. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), IEEE, pp. 908–913.
Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., et al. (2018). A configurable cloud-scale dnn processor for real-time ai. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 1–14.
Jouppi, N. P., Young, C., Patil, N., & Patterson, D. (2018). A domain-specific architecture for deep neural networks. Communications of the ACM, 61(9), 50–59.
Article Google Scholar
NVIDIA. (2017). Nvidia tesla v100 GPU architecture. White paper.
Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S. K., Hernández-Lobato, J. M., Wei, G.-Y., & Brooks, D. (2016). Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 267–278.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12.
Köster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K., Constable, W., Elibol, O., Gray, S., Hall, S., Hornof, L., et al. (2017). Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in neural information processing systems, pp. 1742–1752.
Markidis, S., Der Chien, S. W., Laure, E., Peng, I. B., & Vetter, J. S. (2018). Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp. 522–531.
Gustafson, J. L., & Yonemoto, I. T. (2017). Beating floating point at its own game: Posit arithmetic. Supercomputing Frontiers and Innovations, 4(2), 71–86.
Google Scholar
Carmichael, Z., Langroudi, H. F., Khazanov, C., Lillie, J., Gustafson, J. L., & Kudithipudi, D. (2019). Deep positron: A deep neural network using the posit number system. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, pp. 1421–1426.
Chaurasiya, R., Gustafson, J., Shrestha, R., Neudorfer, J., Nambiar, S., Niyogi, K., Merchant, F., & Leupers, R. (2018). Parameterized posit arithmetic hardware generator. In 2018 IEEE 36th International Conference on Computer Design (ICCD), IEEE, pp. 334–341.
P. W, & Group. (2018). Posit standard documentation. Release, 3, 2.
Chen, Y.-H., Krishna, T., Emer, J. S., & Sze, V. (2016). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1), 127–138.
Article Google Scholar
Koeplinger, D., Feldman, M., Prabhakar, R., Zhang, Y., Hadjis, S., Fiszel, R., Zhao, T., Nardi, L., Pedram, A., Kozyrakis, C., et al. (2018). Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 296–311.
Nowatzki, T., Gangadhar, V., Ardalani, N., & Sankaralingam, K. (2017). Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 416–429.
Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T., Hadjis, S., Pedram, A., Kozyrakis, C., & Olukotun, K. (2017). Plasticine: A reconfigurable architecture for parallel patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 389–402.
Neves, N., Tomás, P., & Roma, N. (2017). Adaptive in-cache streaming for efficient data management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 7, 2130–2143.
Jaiswal, M. K., and So, H. K. (2018). Architecture generator for type-3 unum posit adder/subtractor. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, pp. 1–5.
Forget, L., Uguen, Y., & De Dinechin, F. (2019). Hardware cost evaluation of the posit number system. In Compas’2019 - Conférence d’informatique en Parallélisme, Architecture et Système, pp. 1–7.
Zhang, H., et al. (2019) Efficient posit multiply-accumulate unit generator for deep learning applications. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, pp. 1–5.
Ghosh, S., Martonosi, M., & Malik, S. (1997). Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th international conference on Supercomputing, pp. 317–324.
Paiágua, S., Pratas, F., Tomás, P., Roma, N., & Chaves, R. (2013). Hotstream: Efficient data streaming of complex patterns to multiple accelerating kernels. In 2013 25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), IEEE, pp. 17–24.
Hussain, T., Palomar, O., Unsal, O., Cristal, A., Ayguadé, E., & Valero, M. (2014). Advanced Pattern based Memory Controller for FPGA based HPC applications. In 2014 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp. 287–294.
Grosser, T., Zheng, H., Aloor, R., Simbürger, A., Größlinger, A., & Pouchet, L.-N. (2011). Polly-polyhedral optimization in llvm. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), 2011, 1.
Google Scholar
De Dinechin, F., Forget, L., Muller, J.-M., & Uguen, Y. (2019). Posits: the good, the bad and the ugly. In Proceedings of the Conference for Next Generation Arithmetic, 2019, 1–10.
Google Scholar
Viitanen, T., Jääskeläinen, P., Esko, O., & Takala, J. (2013). Simplified floating-point division and square root. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp. 2707–2711.
Guthaus, M. R., Stine, J. E., Ataei, S., Chen, B., Wu, B., & Sarwar, M. (2016). Openram: An open-source memory compiler. In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), IEEE, pp. 1–6.
Svensson, B. J. (2016). Exploring opencl memory throughput on the zynq. Technical Report.

Download references

Author information

Authors and Affiliations

INESC-ID, Instituto de Telecomunicações, Lisbon, Portugal
Nuno Neves
INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
Pedro Tomás & Nuno Roma

Authors

Nuno Neves
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Tomás
View author publications
You can also search for this author in PubMed Google Scholar
Nuno Roma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuno Neves.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under projects UIDB/50021/2020 and PTDC/EEI-HAC/30485/2017.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Neves, N., Tomás, P. & Roma, N. A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming. J Sign Process Syst 93, 1365–1385 (2021). https://doi.org/10.1007/s11265-021-01687-7

Download citation

Received: 25 October 2020
Revised: 11 April 2021
Accepted: 26 July 2021
Published: 28 November 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11265-021-01687-7

A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Randomized tensor decomposition using parallel reconfigurable systems

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

High Performance Tensor–Vector Multiplication on Shared-Memory Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Randomized tensor decomposition using parallel reconfigurable systems

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

High Performance Tensor–Vector Multiplication on Shared-Memory Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now