Archive-name: os-research/part3
Version: $Revision: 1.3 $
Posting-Frequency: monthly
Last-Modified: Tue Aug 13 21:03:20 1996
URL: http://www.serpentine.com/~bos/os-faq/
Answers to frequently asked questions
for comp.os.research: part 3 of 3
Copyright (C) 1994--1996
Bryan O'Sullivan
TABLE OF CONTENTS
1. Distributed systems
1.1. What is the current status of the (insert name) project?
1.2. How do approaches to load balancing differ?
1.3. Fault tolerance in distributed systems
1.4. Naming in distributed systems
1.5. Distributed shared memory
1.5.1. Data consistency
1.5.1.1. Strictly consistent systems
1.5.1.2. Relaxing consistency
1.5.1.3. Application-specific coherence
1.5.2. Access synchronisation
1.5.3. Transfer and caching granularity
1.5.4. Address space structure
1.5.5. Fault tolerance
1.5.6. A brief bibliography on distributed shared memory
1.6. What have we learned?
2. Needful things
Subject: [1] Distributed systems
From: Distributed systems
A great deal of the high-profile research carried out in operating
systems these days deals with distributed computing. Not
surprisingly, discussions of distributed systems make up a large
amount of the traffic on comp.os.research.
Subject: [1.1] What is the current status of the (insert name) project?
From: Distributed systems
See the section on `available software' for information on
distributions of some of the systems mentioned here.
- The Amoeba project is still going. There are roughly 20 people
working on it, but most of these are no longer kernel hackers. They
are working on using it for parallel programming, wide-area
distributed systems, and other things. Amoeba is used in over 100
universities at the moment, and is also used at commercial
institutions.
- Brazil is the new research operating system being developed at AT&T
Bell Labs. Research topics being addressed in Brazil center on
higher-performance machines and, particularly, networks. A new
in-house 300 megabit/s switched fiber network increases the
potential bandwidth between machines by at least an order of
magnitude; our aim is to realize and exploit that bandwidth. The
overall design is to eliminate unnecessary overhead, particularly by
restructuring and redesigning where necessary to avoid copying data
from element to element along the communications path. Most of this
software (except the operating system kernel) is written in a new
concurrent systems programming language, Alef, which makes it easy
to write multi-process servers and applications that can communicate
using messages or shared memory, as appropriate. A paper on Alef is
available from the Plan 9 ftp site; see part 2 of this FAQ for a
pointer.
- Cronus is still under development at BBN. The current public
release is 3.0. The project currently has two thrusts---as the base
for advanced distributed system R&D, and as a platform for
constructing and deploying sophisticated distributed applications.
Ongoing research topics include the integration of Cronus and Mach
technology, the exploration of techniques for the construction of
WAN-based and multi-organisational applications, investigation into
the integration of distributed systems and network management
systems, and work in high-performance distributed computing.
- Horus is being developed by the same group that worked on Isis; the
head of this group is Robbert van Renesse.
- Isis is no longer being developed at Cornell; it is now managed as a
commercial product.
- Mach is no longer being developed at CMU. Current work on Mach is
being carried out by the OSF Research Institute and at the
University of Utah.
- Plan 9 is no longer in development at AT&T Bell Labs. fibre-optic
network. The operating systems research group at Bell Labs has
moved on to a new project, called Brazil, which addresses portable
computing and distributed applications programming.
- QNX is a commercial POSIX-certified realtime OS with an installed
base of over 250,000 systems. It is used extensively in process
control, factory automation, medical instrumentation, communications
and point-of-sale. A number of universities are also doing research
with QNX.
- The Sprite network operating system project has ended.
Subject: [1.2] How do approaches to load balancing differ?
From: Distributed systems
Load-balancing policy falls into two broad groups: static and dynamic.
Static policies use algorithms which operate without regard to
run-time loads across a system, while dynamic policies use the
run-time performance of various parts of a system in order to make
more `informed' decisions about balancing.
[92-11-06-12-53.57] A dynamic load-balancing policy is one which uses
run-time state information in making scheduling decisions.
There are two kinds of dynamic policies: adaptive and non-adaptive.
The latter always use the same (fixed, load-dependent) policy; the
former may adjust policy parameters in order to gradually improve
their performance.
The key point is that while non-adaptive policies use only the
information about the run-time state, adaptive policies use, in
addition to this, information about current performance.
In adaptive policies, the rules for adjusting policy parameters may be
static or dynamic. An example of the former might be: `shift to a
conservative migration rule when system-wide load patterns are varying
too rapidly'. An example of the latter could be: `increase
sender-side threshold when migrated jobs cause slowdown rather than
speedup'. Some researchers refer to the performance-driven adaptation
exhibited by the second policy as `learning'.
Since both non-adaptive policies and adaptive policies with static
rules really use only load information, it is confusing to distinguish
between them. One way to avoid such confusion is to restrict the use
of the word `adaptive' to policies that use performance feedback in
order to drive their adjustment of policy parameters.
Subject: [1.3] Fault tolerance in distributed systems
From: Distributed systems
One approach to providing fault tolerance in distributed systems
involves the use of redundant services, such that standby facilities
can become active in the event of the failure of, or loss of
connection to, a primary service.
Another approach is to provide multiple paths of connectivity between
the computers that make up the distributed system. The QNX system,
for example, supports multiple network drivers per node. The purpose
of the network connection under QNX is to merge the microkernels on
the LAN into a single logical kernel. Hence, if multiple LAN
connections per node are present, the networking code can load balance
the LAN traffic on the paths available. It can also route around
failed links, providing both greater LAN bandwidth and better fault
tolerance.
See below for treatment of fault tolerance in systems which make use
of distributed shared memory.
Subject: [1.4] Naming in distributed systems
From: Distributed systems
[Material on naming and/or global naming sought.]
Subject: [1.5] Distributed shared memory
From: Distributed systems
Distributed computer systems have evolved using message passing as
their main method of communication. Other communication systems used
in loosely coupled distributed systems, such as RPC, are usually
implemented on top of an underlying message passing system. On the
other hand, in tightly coupled systems, such as a multi-processor
machine, the communication method used is usually shared memory.
In distributed shared memory (DSM) systems [Nitzberg & Lo, 91],
processes share data transparently across node boundaries; data
faulting, location, and movement is handled by the underlying system.
Among other things, this allows parallel programs designed to use
shared memory to execute transparently on a loosely coupled
distributed system. While the performance implications cannot be
ignored, the advantages of the shared memory programming model are
well known:
- Shared memory programs are usually shorter and easier to understand
than equivalent message passing programs.
- Large or complex data structures may easily be communicated.
- Shared memory gives transparent process-to-process communication.
- Programming with shared memory is a well-understood problem.
Shared-memory (or `procedure-oriented') and message-oriented operating
systems are, in some sense, equivalent [Lauer & Needham, 78], though
it has been claimed that the former are `more powerful' [Tam et al.,
90].
Subject: [1.5.1] Data consistency
From: Distributed systems
Despite recent advances in both local and wide-area networking
technologies, network latency is still a major factor in distributed
systems and likely to remain so. All DSM systems provide some sort of
caching in an attempt to improve the performance beyond that provided
by doing a network access on every reference to a non-local data item.
Each system must decide whether or not to attempt to keep the data
coherent, and, if so, what coherence strategy to use. The coherence
semantics which may be provided to the programmer include:
- `strict' consistency, where a read always returns the value written
by the most recent write
- a `loosely' consistent system where the system enforces some form of
weak consistency guarantees and the application (or compiler or
user) can indicate synchronisation points where consistency must be
enforced;
- no automatic consistency mechanism, but provide the user with the
facilities necessary to implement user level synchronisation and
consistency.
Subject: [1.5.1.1] Strictly consistent systems
From: Distributed systems
Older, strictly consistent systems tend to enforce a single writer,
multiple reader model, where at any time data will be held either at a
single node (which may have write access) or several nodes (none of
which may have write access).
Given this model, we must be able to locate a copy of our data when it
is not resident. The method most frequently used is to assign an
`owner' to each item of data, where the owner has either the only
writeable copy of the data, or one of the read-only copies. Ownership
may remain fixed throughout the life of a datum, or it may change
dynamically. In the latter case, the problem arises of locating the
owner. A database of locations may be maintained by centralised
managers, or ownership information can be distributed among nodes of
the system [Li and Hudak, 89].
In a strictly consistent system, we must also be able to synchronise
writes. The two major solutions to this problem are:
- Write broadcast. The effects of every write are broadcast to ever
node that has a copy of the data being written; this effectively
implements a replication algorithm. Write broadcast is usually
considered too expensive to be used as a general solution.
- Write invalidation. Each node in the system holding a read-only
copy of the data being written is sent an invalidation message.
Subject: [1.5.1.2] Relaxing consistency
From: Distributed systems
Permitting temporary inconsistencies is a common method of increasing
performance in distributed systems. Memory is said to be loosely
coherent if the value returned by a read operation is the value
written by an update operation to the same object that `could' have
immediately preceded the read operation in some legal schedule of the
threads in execution [Bennett et al., 90].
Using loose coherence, more than one thread may have write access to
the same object, provided that the programmer knows that the writes
will not conflict.
Another memory consistency model is `release consistency'
[Gharachorloo et al., 90], in which memory accesses are divided into
ordinary and synchronisation-related accesses. The latter are further
divided into `acquire' and `release' operations. The `acquire'
operation indicates that shared data is needed, and a processor's
updates are not guaranteed to be performed at other nodes until a
`release' is performed. The primary advantage of this form of
consistency is that it allows consistency updates to be tied to
synchronisation events, and therefore to be delayed until actually
needed by applications. However, most release consistent systems
require the programmer to make explicit use of `acquire' and `release'
operations.
A DSM system called Midway introduces another new consistency model,
`entry consistency' [Bershad et al., 93]. Entry consistency is weaker
than many of the other models suggested, including release
consistency; it requires explicit annotations to associate
synchronisation objects and data. On an `acquire', only the data
associated with the synchronisation object is guaranteed to be
consistent. This extra weakness permits higher performance
implementations of the underlying consistency protocols to be written.
Midway also supports stronger consistency models, so that the
application programmer can trade-off performance against the extra
effort required to write entry consistent programs.
Subject: [1.5.1.3] Application-specific coherence
From: Distributed systems
From [Cheriton, 86]:
`Problem-oriented shared memory' is a shared memory that implements
fetch and store operations specialised to the particular problem or
application it is supporting. In particular, a problem-oriented
shared memory commonly provides a specialised form of consistency
and consistency maintenance that exploits application-specific
semantics.
Cheriton goes on to propose that consistency constraints be relaxed
and more use be made of problem semantics. He suggests that, in some
cases, stale data may be detected on use by the client, and the client
may then recover. A example would be hint caching. In some
applications, stale data may actually be sufficiently accurate,
provided that the client can obtain up to date information when
necessary. In other applications, some data may be optional in the
sense that the client can continue without it. Other applications may
tolerate having the results of store operations being lost or undone,
for example, an application that regularly updates the entire data
set.
Another approach is presented by the designers of Munin, where the
runtime system accepts hints from the compiler or user to determine
the coherence mechanism to be used for each object. The default, in
the absence of hints, is to use a general read-write consistency
mechanism, much like that employed by IVY. Munin supports several
different object types that are based on the results of a survey of
shared memory access characteristics. The results of the survey
showed that a very small percentage of all accesses to shared data
fall under the general read-write type. The Munin designers also note
that a program moves through various stages of execution, and the
types associated with objects change as time progresses
Subject: [1.5.2] Access synchronisation
From: Distributed systems
Most parallel applications will use some sort of synchronisation
system to order and control accesses to shared data before actually
accessing the data. The most important thing to note in DSM systems
is that just blindly using standard test and set operations on bytes
in shared pages will produce a high fault rate; faults are usually
expensive, making this approach unacceptable.
Clouds merges locking with the cache consistency protocol, so that the
user may obtain both a lock and the data in one network transaction.
This system has the advantage that no invalidation messages are
required, since the granting of the lock guarantees that there are no
conflicting copies; it has the disadvantage that an explicit
unlock/discard operation is required to release access to the data.
This is acceptable in Clouds, as the DSM system was designed
specifically to support object invocation, so it is easy to discard on
a return.
Munin provides a distributed lock mechanism using `proxy objects' to
reduce network load. Proxy objects are maintained by a lock server on
each node; when a thread wants to obtain a lock on an object, it
attempts to lock the proxy instead. The server obtains the global
lock if it is not already held locally. Global locking is done by
negotiating with all the other lock servers in the system. Each lock
may be migrated from server to server, and part of the Munin system
allows objects to be migrated along with their locks.
Other systems, such as IVY and Mermaid, use modified versions of classic
multiprocessor synchronisation facilities.
Subject: [1.5.3] Transfer and caching granularity
From: Distributed systems
When caching objects in local memory, it is necessary to decide what
level of granularity to use. All current systems use a fixed block
size in the cache, rather than varying the granularity based on object
size. Usually this is due to constraints imposed by the system
hardware and memory management.
The choice of the block size in the cache depends on several issues.
- Cost of communication: for example, on many local area networks
there is little difference between the time required to send a
one-byte message and that required to send a 1024-byte message.
Transmitting bulk changes rather than single-byte modifications
would therefore seem desirable.
- The choice of granularity also depends on the locality of reference
in the application, as thrashing may occur when two machines are
both accessing the same block (this is also known as the `ping-pong
effect'). This would seem to argue for a smaller block size. It
should be noted that many object-oriented systems exhibit very poor
locality of reference.
In practice, a compromise must be achieved, as with conventional
virtual memory systems. Most systems use a block size which is the
same as that of the virtual memory management unit on the system, or a
multiple thereof. Among other things, it allows the hardware to be
used to help in the maintenance of consistency. The choice is
complicated somewhat when heterogeneous machines are being used, but
in these cases, the lowest common multiple of hardware supported page
sizes can usually be used.
The only major system that doesn't use a large block size is Memnet,
in which a hardware based DSM system was implemented on a high speed
token ring; a 32-byte block size was used instead [Delp & Farber].
The choice of a small block size is appropriate, as the system is much
closer to a shared memory multi-processor than it is to a software DSM
system. This is because the entire processor is blocked on a cache
miss; the processor is not actually aware of the distributed nature of
its address space. Also, the ratio between remote and local memory
access times is much lower than in the software based systems due to
the dedicated token ring (200Mbps) and hardware assistance.
Subject: [1.5.4] Address space structure
From: Distributed systems
In a single shared address space system, the system appears as a set
of threads executing in a shared distributed address space. Objects
always appear at the same addresses on all nodes. Single address
space systems have had a resurgence in popularity with the arrival of
64-bit processors. A number of researchers believe that a 64-bit
address space is large enough to act as a single global address space
for all the memory (both primary and secondary) in a distributed
system. Examples of such systems include Angel, Mungi, and Opal.
Security and protection are a major problem in such systems, and
current approaches either rely on hardware assistance or stochastic
algorithms, or ignore the problem.
Another approach is to divide each process's address space into
different fixed regions, some of which are private and not shared, and
some of which are shared with some other processes. Ra, the Clouds
kernel, takes this approach using O, P, and K address regions, with
the O region shared between all processes executing in a given object;
the P and K regions are local to a process and kernel, respectively.
Here objects always appear at the same address but may not be visible
from every address space. By contrast, some systems, including Mirage
and Mach, allow shared data to exist at differing addresses in
different processes address spaces. However, neither system does
transparent pointer translation, so the address changes are not
entirely transparent to the application.
As for the structuring of the shared region itself, some systems --
for example, IVY and Mether -- use a single flat region: one
continuous range of virtual addresses represent the shared address
space and are managed by the DSM system. This single address space is
usually sub-divided into pages. Most systems use paged segmentation:
the shared region consists of disjoint pieces, which are usually
managed separately and are not all mapped in any one process.
Frequently, the segments (sometimes called memory objects, or windows)
are related to the backing store. For example, in Clouds, the object
address space consists of windows onto larger segments; these segments
are usually maintained on secondary storage.
Subject: [1.5.5] Fault tolerance
From: Distributed systems
Most DSM systems ignore the fault tolerance issue or maintain that it
is an operating system issue and should be handled by the underlying
system. However, it would appear that in practice a DSM system would
strongly effect the fault tolerance of a system. For example, in a
system where several systems are sharing access to a set of data, the
failure of any one of them could lead to the failure of all the
connected sites (or, at least, some of the processes on each site).
We are also presented with an unusual failure handling problem. It is
fairly easy to see how to handle a failed message or RPC, but how do
you handle a failed page fault?
The original Clouds system provided recoverability using shadowing of
segments and a transactional system using commits. The recovery
system was not really integrated with the DSM system and was merely
implemented at the segment storage site. In order to maintain a
consistent view of data when one transaction is active at multiple
nodes, they have more recently been forced to integrate the
transaction system with the DSM support system.
Subject: [1.5.6] A brief bibliography on distributed shared memory
From: Distributed systems
[Nitzberg & Lo, 1991]
Nitzberg, W. and Lo, V., `Distributed shared memory: a survey of
issues and algorithms', IEEE Computer, August 91, pp. 52-60
[Lauer & Needham, 1978]
[Tam et al., 90]
Tam, M.-C., Smith, J. M. & Farber, D. J., `A taxonomy-based
comparison of several distributed shared memory systems', ACM
Operating Systems Review 24(3), July 90, pp. 40-67
[Li and Hudak, 89]
Li, K. & Hudak, P., `Memory coherence in shared virtual memory
systems', ACM Transactions on Computer Systems 7(4), November 89,
pp. 321-359
[Bennett et al., 90]
Bennett, J. K., Carter, J. B. & Zwaenopoel, W., `Munin:
distributed shared memory based on type-specific memory
coherence', Proceedings of the 2nd ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, SIGPLAN Notices
25(3), March 90, pp. 168-176
[Gharachorloo et al., 90]
Gharachorloo, K., et al., `Memory consistency and event ordering in
scalable shared-memory multiprocessors', ACM SIGARCH News 18(2),
June 90
[Bershad et al., 93]
Bershad, B. N., et al., `The Midway distributed shared memory
system', Technical Report CMU-CS-93-119, School of Computer
Science, Carnegie Mellon University, 1993. Available via
anonymous ftp from
<URL:ftp://ftp.cs.cmu.edu/project/mach/public/doc/published/>.
[Cheriton, 86]
Cheriton, D. R., `Problem-oriented shared memory: a decentralized
approach to distributed system design', Proceedings of the 6th
International Conference on Distributed Computing Systems, May 86,
pp. 190-197
[Delp & Farber]
Delp, G. S. & Farber, D. J., `Memnet -- a different approach to a
network', Technical Report, Department of Electrical Engineering,
University of Delaware, ???
Subject: [1.6] What have we learned?
From: Distributed systems
Andy Tanenbaum started a (very long) thread on this topic in
comp.os.research in April of 1992 [92-04-03-17-10.05]. The interested
reader is directed to the comp.os.research archives, since this thread
proved rather divisive (i.e. nobody really agreed on any issue).
Subject: [2] Needful things
From: Needful things
This FAQ is incomplete, and will probably remain in this state to a
greater or lesser extent for ever and ever. Should you feel willing
to contribute some material, the following is a list of topics which
``urgently'' require treatment (some of which I may get around to
covering myself at some point):
- naming in distributed systems
|