For the last few years, single-thread performance has been improving at a snail’s pace. Power limitations, increasing relative memory latency, and the exhaustion of improvement in instruction-level parallelism are forcing microprocessor architects to examine new processor design strategies. In this dissertation, I take a look at a technology that can improve the efficiency of modern microprocessors: vectors. Vectors are a simple, power-efficient way to take advantage of common data-level parallelism in an extensible, easily-programmable manner. My work focuses on the process of transitioning from traditional scalar microprocessors to computers that can take advantage of vectors.
First, I describe a process for extending existing single-instruction, multiple-data instruction sets to support full vector processing, in a way that remains binary compatible with existing applications. Initial implementations can be low cost, but be transparently extended to higher performance later.
I also describe ViVA, the Virtual Vector Architecture. ViVA adds vector-style memory operations to existing microprocessors but does not include arithmetic datapaths; instead, memory instructions work with a new buffer placed between the core and second-level cache. ViVA serves as a low-cost solution to getting much of the performance of full vector memory hierarchies while avoiding the complexity of adding a full vector system.
Finally, I test the performance of ViVA by modifying a cycle-accurate full-system simulator to support ViVA’s operation. After extensive calibration, I test the basic performance of ViVA using a series of microbenchmarks. I compare the performance of a variety of ViVA configurations for corner turn, used in processing multidimensional data, and sparse matrix-vector multiplication, used in many scientific applications. Results show that ViVA can give significant benefit for a variety of memory access patterns, without relying on a costly hardware prefetcher.
Cited By
- Stanic M, Palomar O, Hayes T, Ratkovic I, Unsal O and Cristal A Towards low-power embedded vector processor Proceedings of the ACM International Conference on Computing Frontiers, (339-342)
- Soliman M (2013). Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions, Journal of Parallel and Distributed Computing, 73:6, (836-850), Online publication date: 1-Jun-2013.
Recommendations
A low-complexity microprocessor design with speculative pre-execution
Current superscalar architectures strongly depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue requires a centralized structure and mainly causes globally broadcasting ...
Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions
This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The ...
The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor
COMPCON '95: Proceedings of the 40th IEEE Computer Society International ConferenceThe PowerPC 620 RISC microprocessor is the first chip for the application server and technical workstation product line within the PowerPC family. It utilizes a high performance microarchitecture with many advanced superscalar features to exploit ...