OMP: a RISC-based multiprocessor using orthogonal-access memories and multiple spanning buses
- K. Hwang,
- M. Dubois,
- D. K. Panda,
- S. Rao,
- S. Shang,
- A. Uresin,
- W. Mao,
- H. Nair,
- M. Lytwyn,
- F. Hsieh,
- J. Liu,
- S. Mehrotra,
- C. M. Cheng
This paper presents the architectural design and RISC based implementation of a prototype supercomputer, namely the Orthogonal MultiProcessor (OMP). The OMP system is constructed with 16 Intel 1860 RISC microprocessors and 256 parallel memory modules, ...
A basic architecture supporting LGDG computation
In order to combine the benefits of dataflow and control-flow computation while avoiding the pitfalls of both, the authors propose a two-level model of large-grain dataflow computation, called LGDG computation. A formalism has been provided in a ...
An efficient caching support for critical sections in large-scale shared-memory multiprocessors
Directory-based and software-assisted schemes are the two main approaches to solving the cache coherence problem in large scale shared-memory multiprocessors. Until now, the emphasis in software-assisted schemes has been on ascertaining consistency ...
An improvement of I/O function for auxiliary storage: parallel I/O for a large scale supercomputing
New I/O technique for external auxiliary storage: magnetic disk unit, has been developed to improve the I/O performance on HITAC VOS3/ES1 with usual hardware architecture. Since the I/O technique is based on the idea that the sequence of I/O processes ...
Analysis of a variant hypercube topology
Each node of a hypercube system, when fabricated, comes with a fixed number of links designed for a maximum sized construction. Very often, there are links left unused at each node in a real system. In this article, we study the hypercube in which extra ...
Parallel ODE solvers
We are interested in the efficient solution of linear second order Partial Differential Equation (PDE) problems on rectangular domains. The PDE discretisation scheme used is of Finite Element type and is based on quadratic splines and the collocation ...
Use of parallel level 3 BLAS in LU factorization on three vector multiprocessors the ALLIANT FX/80, the CRAY-2, and the IBM 3090 VF
We show how to transform the B-spline curve and surface fitting problems into suffix computations of continued fractions. Then a parallel substitution scheme is introduced to compute the suffix values on a newly proposed mesh-of-unshuffle network. The ...
Schur complement preconditioned conjugate gradient methods for spline collocation equations
We are interested in the efficient solution of linear second order Partial Differential Equation (PDE) problems on rectangular domains. The PDE discretisation scheme used is of Finite Element type and is based on quadratic splines and the collocation ...
Cost-optimal parallel B-spline interpolations
We show how to transform the B-spline curve and surface fitting problems into suffix computations of continued fractions. Then a parallel substitution scheme is introduced to compute the suffix values on a newly proposed mesh-of-unshuffle network. The ...
Solving general sparse linear systems using conjugate gradient-type methods
The problem of finding an approximation of @@@@ = A†b (where A† is the pseudo-inverse of A ∈ @@@@m@@@@n with m ≥ n and rank(A) = n) is discussed. It is assumed that A is sparse but has neither a special pattern (as bandedness) nor a special property (as ...
Dataflow computer development in Japan
This paper describes the research activity on dataflow computing in Japan focusing on dataflow computer development at the Electrotechnical Laboratory (ETL). First, the history of dataflow computer development in Japan is outlined. Some distinguished ...
POSC—a partitioning and optimizing SISAL compiler
Single-assignment languages like SISAL offer parallelism at all levels—among arbitrary operations, conditionals, loop iterations, and function calls. All control and data dependencies are local, and can be easily determined from the program. Various ...
Loop optimization for horizontal microcoded machines
Long Instruction Word (LIW) architectures exploit parallelism between various functional units. In order to produce efficient code for such an architecture, the microcode compiler will have to expose a relatively large degree of fine grain parallelism ...
Compiler techniques for data synchronization in nested parallel loops
The major source of parallelism in ordinary programs is do loops. When loop iterations of parallelized loops are executed on multiprocessors, the cross-iteration data dependencies need to be enforced by synchronization between processors. Existing data ...
Compiler techniques for data partitioning of sequentially iterated parallel loops
This paper uses bottom-up, static program partitioning to minimize the execution time of parallel programs by reducing interprocessor communication. Program partitioning is applied to a parallel programming construct known as a sequentially iterated ...
On the perfect accuracy of an approximate subscript analysis test
The Banerjee test is commonly considered to be the more accurate of the two major approximate data dependence tests used in automatic vectorization/parallelization of loops, the other being the GCD test. From its derivation, however, there is no simple ...
A hardware-based performance monitor for the Intel iPSC/2 hypercube
The complexity of parallel computer systems makes a priori performance prediction difficult and experimental performance analysis crucial. A complete characterization of software and hardware dynamics, needed to understand the performance of high-...
Performance degradation due to multiprogramming and system overheads in real workloads: case study on a shared memory multiprocessor
In this paper, performance degradation specifically due to the multiprogramming (MP) overhead in a parallel execution environment is quantified. In addition, total system overhead is also measured. A methodology, which estimates the MP overhead present ...
SPARK: a benchmark package for sparse computations
As the diversity of novel architectures expands rapidly there is a growing interest in studying the behavior of these architectures for computations arising in different applications. There has been significant efforts in evaluating the performance of ...
Supercomputer performance evaluation and the Perfect Benchmarks
In the past three years, the Perfect BenchmarkTM Suite has evolved from a supercomputer performance evaluation plan, presented by Kuck and Sameh at the 1987 International Conference on Supercomputing, to a vigorous international activity. This paper ...
Strategies for large-scale structural problems on high-performance computers
Novel computational strategies are presented for the analysis of large and complex structures. The strategies are based on generating the response of the complex structure using large perturbations from that of a simpler model, associated with a simpler ...
Elastodynamics on clustered vector multiprocessors
We present the parallelization of an elastodynamic code on a firmly coupled configuration consisting of two IBM 3090-600 VF, a total of 12 processors, joined with a connection facility. The programming environment used is Clustered FORTRAN which is a ...
Implementation of 5-point/9-point multi-level methods on hypercube architectures
Computational complexity of implementing 5/9-point multi-level methods on hypercube architectures is considered. The embedding of the nested red/black structures of these methods is described, and an analysis is made of data distances involved.
Supercomputer-based visualization systems used for analyzing output data of a numerical weather prediction model
Comparison of two supercomputer-based visualization systems developed over a half-year period show that the visualization/animation efficiency is largely dependent upon the efficiencies of individual computers, networking, and memory management.
Using a ...
Parallel automated wire-routing with a number of competing processors
The purpose of the automated wire routing for VLSI and printed circuit board design is to connect a number of terminal pairs distributed throughout wiring plane with net paths which do not intersect each other. Although maze running and line search are ...
Hierarchical algorithms and architectures for parallel scientific computing
There has been a recent emergence of many interesting and highly efficient hierarchical (multilevel) algorithms (e.g. multigrid, domain decomposition, wavelets, multilevel preconditioning, the fast multipole algorithms, etc.) for solving numerical ...
Incremental dependence analysis for interactive parallelization
Incrementally updating dependence information during interactive parallelization is a difficult proposition. We have developed a tool (PAT) that maintains dependence information during incremental transformations to a Fortran program, including loop ...
Parallelization of FORTRAN code on distributed-memory parallel processors
This paper presents some preliminary results toward the automatic parallelization of uniprocessor FORTRAN code on distributed-memory parallel processors (DMPPs). The paper introduces Oxygen, a compiler for a DMPP under development at the Laboratory. The ...