[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
Volume 20, Issue 2May 1992Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)
Reflects downloads up to 05 Jan 2025Bibliometrics
article
Free
A performance study of memory consistency models

Recent advances in technology are such that the speed of processors is increasing faster than memory latency is decreasing. Therefore the relative cost of a cache miss is becoming more important. However, the full cost of a cache miss need not be paid ...

article
Free
Lazy release consistency for software distributed shared memory

Relaxed memory consistency models, such as release consistency, were introduced in order to reduce the impact of remote memory access latency in both software and hardware distributed shared memory (DSM). However, in a software DSM, it is also important ...

article
Free
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of ...

article
Free
Effects of building blocks on the performance of super-scalar architecture

The inherent low level parallelism of Super-Scalar architectures plays an important role in the processing power provided by these machines: independent functional units promote opportunities for executing several machine operations simultaneously. From ...

article
Free
Limits of control flow on parallelism

This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by ...

article
Free
The expandable split window paradigm for exploiting fine-grain parallelsim

We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting fine-grain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain ...

article
Free
Towards a shared-memory massively parallel multiprocessor

A set of ultra high throughput (more than one Gigabits per second) serial links used as processor-memory network can lead to the starting up of a shared-memory massively parallel multiprocessor. The bandwidth of the network is far beyond values found in ...

article
Free
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use ...

article
Free
The DASH prototype: implementation and performance

The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design ...

article
Free
Performance evaluation of a decoded instruction cache for variable instruction-length computers

A Decoded INstruction Cache (DINC) serves as a buffer between the instruction decoder and the other instruction-pipeline stages. In this paper we explain how techniques that reduce the branch penalty based on such a cache, can improve CPU performance. ...

article
Free
A simulation based study of TLB performance

This paper presents the results of a simulation-based study of various translation lookaside buffer (TLB) architectures, in the context of a modern VLSI RISC processor. The simulators used address traces, generated by instrumented versions of the SPEC ...

article
Free
Alternative implementations of two-level adaptive branch prediction

As the issue rate and depth of pipelining of high performance Superscalar processors increase, the importance of an excellent branch predictor becomes more vital to delivering the potential performance of a wide-issue, deep pipelined microarchitecture. ...

article
Free
An elementary processor architecture with simultaneous instruction issuing from multiple threads

In this paper, we propose a multithreaded processor architecture which improves machine throughput. In our processor architecture, instructions from different threads (not a single thread) are issued simultaneously to multiple functional units, and ...

article
Free
Thread-based programming for the EM-4 hybrid dataflow machine

In this paper, we present a thread-based programming model for the EM-4 hybrid dataflow machine, where parallelism and synchronization among threads of sequential execution are described explicitly by the programmer. Although EM-4 was originally ...

article
Free
T: a multithreaded massively parallel architecture

What should the architecture of each node in a general purpose, massively parallel architecture (MPA) be? We frame the question in concrete terms by describing two fundamental problems that must be solved well in any general purpose MPA. From this, we ...

article
Free
Adjustable block size coherent caches

Several studies have shown that the performance of coherent caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, ...

article
Free
Performance optimization of pipelined primary cache

The CPU cycle time of a high-performance processor is usually determined by the access time of the primary cache. As processors speeds increase, designers will have to increase the number of pipeline stages used to fetch data from the cache in order to ...

article
Free
Cache replacement with dynamic exclusion

Most recent cache designs use direct-mapped caches to provide the fast access time required by modern high speed CPU's. Unfortunately, direct-mapped caches have higher miss rates than set-associative caches, largely because direct-mapped caches are more ...

article
Free
Processor coupling: integrating compile time and runtime scheduling for parallelism

The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling, a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-...

article
Free
Improved multithreading techniques for hiding communication latency in multiprocessors

Shared memory multiprocessors are considered among the easiest parallel computers to program. However building shared memory machines with thousands of processors has proved difficult because of the inevitably long memory latencies. Much previous ...

article
Free
Instruction-level parallelism in Prolog: analysis and architectural support

The demand of increasing computation power for symbolic processing has given a strong impulse to the development of ASICs dedicated to the execution of prolog. Unlike past microcoded implementation based on the Warren machine model, novel trends in high ...

article
Free
Memory latency effects in decoupled architectures with a single data memory module

Decoupled computer architectures partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine-grain parallelism between the two. These architectures make use of an access processor to ...

article
Free
Interleaved parallel schemes: improving memory throughput on supercomputers

On many commercial supercomputers, several vector register processors share a global highly interleaved memory in a MIMD mode. When all the processors are working on a single vector loop, a significant part of the potential memory throughput may be ...

article
Free
Active messages: a mechanism for integrated communication and computation

The design challenge for large-scale multiprocessors is (1) to minimize communication overhead, (2) allow communication to overlap computation, and (3) coordinate the two without sacrificing processor cost/performance. We show that existing message ...

article
Free
Planar-adaptive routing: low-cost adaptive networks for multiprocessors

Network throughput can be increased by allowing multipath, adaptive routing. Adaptive routing allows more freedom in the paths taken by messages, spreading load over physical channels more evenly. The flexibility of adaptive routing introduces new ...

article
Free
The turn model for adaptive routing

We present a model for designing wormhole routing algorithms that are deadlock free, livelock free, minimal or nonminimal, and maximally adaptive. A unique feature of this model is that it is not based on adding physical or virtual channels to network ...

article
Free
Low-latency message communication support for the AP1000

Low-latency communication is the key to achieving a high-performance parallel computer. In using state-of-the-art processors, we must take cache memory into account. This paper presents an architecture for low-latency message comunication and ...

article
Free
Futurebus+ as an I/O bus: profile B

The IEEE Futurebus+ is a very fast (3GB/sec.), industry standard backplane bus specification for computer systems. Futurebus+ was designed independent of any CPU architecture so it is truly open. With this open architecture Futurebus+ can be applied to ...

article
Free
A study of I/O system organizations

With the increasing processing speeds, it has become important to design powerful and efficient I/O systems. In this paper, we look at several design options in designing an I/O system and study their impact on the performance. Specifically, we use ...

article
Free
Comparison of sparing alternatives for disk arrays

This paper explores how choice of sparing methods impacts the performance of RAID level 5 (or parity striped) disk arrays. The three sparing methods examined are dedicated sparing, distributed sparing, and parity sparing. For database type workloads ...

Subjects

Comments

Please enable JavaScript to view thecomments powered by Disqus.