A performance study of memory consistency models
Recent advances in technology are such that the speed of processors is increasing faster than memory latency is decreasing. Therefore the relative cost of a cache miss is becoming more important. However, the full cost of a cache miss need not be paid ...
Lazy release consistency for software distributed shared memory
Relaxed memory consistency models, such as release consistency, were introduced in order to reduce the impact of remote memory access latency in both software and hardware distributed shared memory (DSM). However, in a software DSM, it is also important ...
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors
The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of ...
Effects of building blocks on the performance of super-scalar architecture
The inherent low level parallelism of Super-Scalar architectures plays an important role in the processing power provided by these machines: independent functional units promote opportunities for executing several machine operations simultaneously. From ...
Limits of control flow on parallelism
This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by ...
The expandable split window paradigm for exploiting fine-grain parallelsim
We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting fine-grain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain ...
Towards a shared-memory massively parallel multiprocessor
A set of ultra high throughput (more than one Gigabits per second) serial links used as processor-memory network can lead to the starting up of a shared-memory massively parallel multiprocessor. The bandwidth of the network is far beyond values found in ...
Comparative performance evaluation of cache-coherent NUMA and COMA architectures
Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use ...
The DASH prototype: implementation and performance
The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design ...
Performance evaluation of a decoded instruction cache for variable instruction-length computers
A Decoded INstruction Cache (DINC) serves as a buffer between the instruction decoder and the other instruction-pipeline stages. In this paper we explain how techniques that reduce the branch penalty based on such a cache, can improve CPU performance. ...
A simulation based study of TLB performance
This paper presents the results of a simulation-based study of various translation lookaside buffer (TLB) architectures, in the context of a modern VLSI RISC processor. The simulators used address traces, generated by instrumented versions of the SPEC ...
Alternative implementations of two-level adaptive branch prediction
As the issue rate and depth of pipelining of high performance Superscalar processors increase, the importance of an excellent branch predictor becomes more vital to delivering the potential performance of a wide-issue, deep pipelined microarchitecture. ...
An elementary processor architecture with simultaneous instruction issuing from multiple threads
- Hiroaki Hirata,
- Kozo Kimura,
- Satoshi Nagamine,
- Yoshiyuki Mochizuki,
- Akio Nishimura,
- Yoshimori Nakase,
- Teiji Nishizawa
In this paper, we propose a multithreaded processor architecture which improves machine throughput. In our processor architecture, instructions from different threads (not a single thread) are issued simultaneously to multiple functional units, and ...
Thread-based programming for the EM-4 hybrid dataflow machine
In this paper, we present a thread-based programming model for the EM-4 hybrid dataflow machine, where parallelism and synchronization among threads of sequential execution are described explicitly by the programmer. Although EM-4 was originally ...
T: a multithreaded massively parallel architecture
What should the architecture of each node in a general purpose, massively parallel architecture (MPA) be? We frame the question in concrete terms by describing two fundamental problems that must be solved well in any general purpose MPA. From this, we ...
Adjustable block size coherent caches
Several studies have shown that the performance of coherent caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, ...
Performance optimization of pipelined primary cache
The CPU cycle time of a high-performance processor is usually determined by the access time of the primary cache. As processors speeds increase, designers will have to increase the number of pipeline stages used to fetch data from the cache in order to ...
Cache replacement with dynamic exclusion
Most recent cache designs use direct-mapped caches to provide the fast access time required by modern high speed CPU's. Unfortunately, direct-mapped caches have higher miss rates than set-associative caches, largely because direct-mapped caches are more ...
Processor coupling: integrating compile time and runtime scheduling for parallelism
The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling, a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-...
Improved multithreading techniques for hiding communication latency in multiprocessors
Shared memory multiprocessors are considered among the easiest parallel computers to program. However building shared memory machines with thousands of processors has proved difficult because of the inevitably long memory latencies. Much previous ...
Instruction-level parallelism in Prolog: analysis and architectural support
The demand of increasing computation power for symbolic processing has given a strong impulse to the development of ASICs dedicated to the execution of prolog. Unlike past microcoded implementation based on the Warren machine model, novel trends in high ...
Memory latency effects in decoupled architectures with a single data memory module
Decoupled computer architectures partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine-grain parallelism between the two. These architectures make use of an access processor to ...
Interleaved parallel schemes: improving memory throughput on supercomputers
On many commercial supercomputers, several vector register processors share a global highly interleaved memory in a MIMD mode. When all the processors are working on a single vector loop, a significant part of the potential memory throughput may be ...
Active messages: a mechanism for integrated communication and computation
The design challenge for large-scale multiprocessors is (1) to minimize communication overhead, (2) allow communication to overlap computation, and (3) coordinate the two without sacrificing processor cost/performance. We show that existing message ...
Planar-adaptive routing: low-cost adaptive networks for multiprocessors
Network throughput can be increased by allowing multipath, adaptive routing. Adaptive routing allows more freedom in the paths taken by messages, spreading load over physical channels more evenly. The flexibility of adaptive routing introduces new ...
The turn model for adaptive routing
We present a model for designing wormhole routing algorithms that are deadlock free, livelock free, minimal or nonminimal, and maximally adaptive. A unique feature of this model is that it is not based on adding physical or virtual channels to network ...
Low-latency message communication support for the AP1000
Low-latency communication is the key to achieving a high-performance parallel computer. In using state-of-the-art processors, we must take cache memory into account. This paper presents an architecture for low-latency message comunication and ...
Futurebus+ as an I/O bus: profile B
The IEEE Futurebus+ is a very fast (3GB/sec.), industry standard backplane bus specification for computer systems. Futurebus+ was designed independent of any CPU architecture so it is truly open. With this open architecture Futurebus+ can be applied to ...
A study of I/O system organizations
With the increasing processing speeds, it has become important to design powerful and efficient I/O systems. In this paper, we look at several design options in designing an I/O system and study their impact on the performance. Specifically, we use ...
Comparison of sparing alternatives for disk arrays
This paper explores how choice of sparing methods impacts the performance of RAID level 5 (or parity striped) disk arrays. The three sparing methods examined are dedicated sparing, distributed sparing, and parity sparing. For database type workloads ...