SIGARCH: Vol 20, No 2

Volume 20, Issue 2May 1992Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)

Volume 20, Issue 2

May 1992

Editor:

Allan Gotlieb
New York Univ., New York, NY

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:0163-5964

Tags:

Bibliometrics

Select All

Export Citations Save to Binder

article

Free

A performance study of memory consistency models

Pages 2–12https://doi.org/10.1145/146628.139674

Recent advances in technology are such that the speed of processors is increasing faster than memory latency is decreasing. Therefore the relative cost of a cache miss is becoming more important. However, the full cost of a cache miss need not be paid ...

article

Free

Lazy release consistency for software distributed shared memory

Pages 13–21https://doi.org/10.1145/146628.139676

Relaxed memory consistency models, such as release consistency, were introduced in order to reduce the impact of remote memory access latency in both software and hardware distributed shared memory (DSM). However, in a software DSM, it is also important ...

article

Free

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

Pages 22–33https://doi.org/10.1145/146628.139678

The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of ...

article

Free

Effects of building blocks on the performance of super-scalar architecture

Pages 36–45https://doi.org/10.1145/146628.139681

The inherent low level parallelism of Super-Scalar architectures plays an important role in the processing power provided by these machines: independent functional units promote opportunities for executing several machine operations simultaneously. From ...

article

Free

Limits of control flow on parallelism

Pages 46–57https://doi.org/10.1145/146628.139702

This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by ...

article

Free

The expandable split window paradigm for exploiting fine-grain parallelsim

Pages 58–67https://doi.org/10.1145/146628.139703

We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting fine-grain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain ...

article

Free

Towards a shared-memory massively parallel multiprocessor

Pages 70–79https://doi.org/10.1145/146628.139704

A set of ultra high throughput (more than one Gigabits per second) serial links used as processor-memory network can lead to the starting up of a shared-memory massively parallel multiprocessor. The bandwidth of the network is far beyond values found in ...

article

Free

Comparative performance evaluation of cache-coherent NUMA and COMA architectures

Pages 80–91https://doi.org/10.1145/146628.139705

Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use ...

article

Free

The DASH prototype: implementation and performance

Pages 92–103https://doi.org/10.1145/146628.139706

The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design ...

article

Free

Performance evaluation of a decoded instruction cache for variable instruction-length computers

Pages 106–113https://doi.org/10.1145/146628.139707

A Decoded INstruction Cache (DINC) serves as a buffer between the instruction decoder and the other instruction-pipeline stages. In this paper we explain how techniques that reduce the branch penalty based on such a cache, can improve CPU performance. ...

article

Free

A simulation based study of TLB performance

Pages 114–123https://doi.org/10.1145/146628.139708

This paper presents the results of a simulation-based study of various translation lookaside buffer (TLB) architectures, in the context of a modern VLSI RISC processor. The simulators used address traces, generated by instrumented versions of the SPEC ...

article

Free

Alternative implementations of two-level adaptive branch prediction

Pages 124–134https://doi.org/10.1145/146628.139709

As the issue rate and depth of pipelining of high performance Superscalar processors increase, the importance of an excellent branch predictor becomes more vital to delivering the potential performance of a wide-issue, deep pipelined microarchitecture. ...

article

Free

An elementary processor architecture with simultaneous instruction issuing from multiple threads

Pages 136–145https://doi.org/10.1145/146628.139710

In this paper, we propose a multithreaded processor architecture which improves machine throughput. In our processor architecture, instructions from different threads (not a single thread) are issued simultaneously to multiple functional units, and ...

article

Free

Thread-based programming for the EM-4 hybrid dataflow machine

Pages 146–155https://doi.org/10.1145/146628.139712

In this paper, we present a thread-based programming model for the EM-4 hybrid dataflow machine, where parallelism and synchronization among threads of sequential execution are described explicitly by the programmer. Although EM-4 was originally ...

article

Free

T: a multithreaded massively parallel architecture

Pages 156–167https://doi.org/10.1145/146628.139715

What should the architecture of each node in a general purpose, massively parallel architecture (MPA) be? We frame the question in concrete terms by describing two fundamental problems that must be solved well in any general purpose MPA. From this, we ...

article

Free

Adjustable block size coherent caches

Pages 170–180https://doi.org/10.1145/146628.139725

Several studies have shown that the performance of coherent caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, ...

article

Free

Performance optimization of pipelined primary cache

Pages 181–190https://doi.org/10.1145/146628.139726

The CPU cycle time of a high-performance processor is usually determined by the access time of the primary cache. As processors speeds increase, designers will have to increase the number of pipeline stages used to fetch data from the cache in order to ...

article

Free

Cache replacement with dynamic exclusion

Scott McFarling

Pages 191–200https://doi.org/10.1145/146628.139727

Most recent cache designs use direct-mapped caches to provide the fast access time required by modern high speed CPU's. Unfortunately, direct-mapped caches have higher miss rates than set-associative caches, largely because direct-mapped caches are more ...

article

Free

Processor coupling: integrating compile time and runtime scheduling for parallelism

Pages 202–213https://doi.org/10.1145/146628.139728

The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling, a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-...

article

Free

Improved multithreading techniques for hiding communication latency in multiprocessors

Pages 214–223https://doi.org/10.1145/146628.139729

Shared memory multiprocessors are considered among the easiest parallel computers to program. However building shared memory machines with thousands of processors has proved difficult because of the inevitably long memory latencies. Much previous ...

article

Free

Instruction-level parallelism in Prolog: analysis and architectural support

Pages 224–233https://doi.org/10.1145/146628.139730

The demand of increasing computation power for symbolic processing has given a strong impulse to the development of ASICs dedicated to the execution of prolog. Unlike past microcoded implementation based on the Warren machine model, novel trends in high ...

article

Free

Memory latency effects in decoupled architectures with a single data memory module

Pages 236–245https://doi.org/10.1145/146628.140380

Decoupled computer architectures partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine-grain parallelism between the two. These architectures make use of an access processor to ...

article

Free

Interleaved parallel schemes: improving memory throughput on supercomputers

Pages 246–255https://doi.org/10.1145/146628.140381

On many commercial supercomputers, several vector register processors share a global highly interleaved memory in a MIMD mode. When all the processors are working on a single vector loop, a significant part of the potential memory throughput may be ...

article

Free

Active messages: a mechanism for integrated communication and computation

Pages 256–266https://doi.org/10.1145/146628.140382

The design challenge for large-scale multiprocessors is (1) to minimize communication overhead, (2) allow communication to overlap computation, and (3) coordinate the two without sacrificing processor cost/performance. We show that existing message ...

article

Free

Planar-adaptive routing: low-cost adaptive networks for multiprocessors

Pages 268–277https://doi.org/10.1145/146628.140383

Network throughput can be increased by allowing multipath, adaptive routing. Adaptive routing allows more freedom in the paths taken by messages, spreading load over physical channels more evenly. The flexibility of adaptive routing introduces new ...

article

Free

The turn model for adaptive routing

Pages 278–287https://doi.org/10.1145/146628.140384

We present a model for designing wormhole routing algorithms that are deadlock free, livelock free, minimal or nonminimal, and maximally adaptive. A unique feature of this model is that it is not based on adding physical or virtual channels to network ...

article

Free

Low-latency message communication support for the AP1000

Pages 288–297https://doi.org/10.1145/146628.140385

Low-latency communication is the key to achieving a high-performance parallel computer. In using state-of-the-art processors, we must take cache memory into account. This paper presents an architecture for low-latency message comunication and ...

article

Free

Futurebus+ as an I/O bus: profile B

Barbara P. Aichinger

Pages 300–307https://doi.org/10.1145/146628.140386

The IEEE Futurebus+ is a very fast (3GB/sec.), industry standard backplane bus specification for computer systems. Futurebus+ was designed independent of any CPU architecture so it is truly open. With this open architecture Futurebus+ can be applied to ...

article

Free

A study of I/O system organizations

A. L. Narasimha Reddy

Pages 308–317https://doi.org/10.1145/146628.140387

With the increasing processing speeds, it has become important to design powerful and efficient I/O systems. In this paper, we look at several design options in designing an I/O system and study their impact on the performance. Specifically, we use ...

article

Free

Comparison of sparing alternatives for disk arrays

Pages 318–329https://doi.org/10.1145/146628.140392

This paper explores how choice of sparing methods impacts the performance of RAID level 5 (or parity striped) disk arrays. The three sparing methods examined are dedicated sparing, distributed sparing, and parity sparing. For database type workloads ...

Subjects

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Sections

Save to Binder

Subjects

Comments