More Web Proxy on the site http://driver.im/

research-article

Recoverable Distributed Shared Virtual Memory

Authors:

W. Kent FuchsAuthors Info & Claims

IEEE Transactions on Computers, Volume 39, Issue 4

Pages 460 - 469

https://doi.org/10.1109/12.54839

Published: 01 April 1990 Publication History

Abstract

The problem of rollback recovery in distributed shared virtual environments, in which the shared memory is implemented in software in a loosely coupled distributed multicomputer system, is examined. A user-transparent checkpointing recovery scheme and a new twin-page disk storage management technique are presented for implementing recoverable distributed shared virtual memory. The checkpointing scheme can be integrated with the memory coherence protocol for managing the shared virtual memory. The twin-page disk design allows checkpointing to proceed in an incremental fashion without an explicit undo at the time of recovery. The recoverable distributed shared virtual memory allows the system to restart computation from a checkpoint without a global restart.

References

[1]

{1} K. Li and P. Hudak, "Memory coherence in shared virtual memory systems," in Proc. 5th ACM Symp. Principles Distributed Comput. , 1986, pp. 229-239.

[2]

{2} K. Li, "IVY: A shared virtual memory system for parallel computing," in Proc. 1988 Int. Conf. Parallel Processing, 1988, pp. 94-101.

[3]

{3} K. Li, "Shared virtual memory on loosely coupled multiprocessors," Ph.D. dissertation, Tech. Rep. YALEU/DCS/RR-492, Dep. Comput. Sci., Yale Univ., Sept. 1986.

[4]

{4} R. Bisiani, A. Nowatzyk, and M. Ravishankar, "Coherent shared memory on a distributed memory machine," in Proc. 1989 Int. Conf. Parallel Processing, Vol. I Architecture, 1989, pp. I-133-I-141.

[5]

{5} U. Ramachandran, M. Ahamad, and M. Y. A. Khalidi, "Coherence of distributed shared memory: Unifying synchronization and data transfer," in Proc. 1989 Int. Conf. Parallel Processing, Vol. II Software, 1989, pp. II-160-II-169.

[6]

{6} C. P. Thacker, L. C. Stewart, and E. H. Satterthwaite, Jr., "Firefly: A multiprocessor workstation," IEEE Trans. Comput., vol. 37, pp. 909-920, Aug. 1988.

Digital Library

[7]

{7} Balance 8000 Technical Summary, Sequent Computer Systems, Inc., Nov. 1984.

[8]

{8} G. F. Pfister, W. C. Brantley, et al., "The IBM research parallel processor prototype (RP3): Introduction and architecture," in Proc. 1985 Int. Conf. Parallel Processing, 1985, pp. 764-770.

[9]

{9} D. Gajski, D. Kuck, D. Lawrie, and A. Sameh, "Cedar--A large scale multiprocessor," in Proc. 1983 Int. Conf. Parallel Processing, 1983, pp. 524-529.

[10]

{10} K. H. Kim, "Programmer-transparent coordination of recovering concurrent processes: Philosophy and rules for efficient implementation," IEEE Trans. Software Eng., vol. 14, pp. 810-821, June 1988.

Digital Library

[11]

{11} Y.-H. Lee and K. G. Shin, "Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks," IEEE Trans. Comput. , vol. C-33, pp. 113-124, Feb. 1984.

[12]

{12} J. Kent and H. Garcia-Molina, "Optimizing shadow recovery algorithms," IEEE Trans. Software Eng., vol. 14, pp. 155-168, Feb. 1988.

Digital Library

[13]

{13} R. A. Lorie, "Physical integrity in a large segmented database," ACM Trans. Database Syst., vol. 2, pp. 91-104, Mar. 1977.

Digital Library

[14]

{14} A. Reuter, "A fast transaction-oriented logging scheme for UNDO recovery," IEEE Trans. Software Eng., vol. SE-6, pp. 348-356, July 1980.

Digital Library

[15]

{15} S. M. Thatte, "Persistent memory: A storage architecture for object-oriented database systems," in Proc. 1986 Int. Workshop Object-Oriented Database Syst., 1986, pp. 148-159.

Digital Library

[16]

{16} R. D. Schlichting and F. B. Schneider, "Fail-stop processors: An approach to designing fault-tolerant computing systems," ACM Trans. Comput. Syst., vol. 1, pp. 222-238, Aug. 1983.

Digital Library

[17]

{17} A. Chang and M. F. Mergen, "801 Storage: Architecture and programming," ACM Trans. Comput. Syst., vol. 6, pp. 28-50, Feb. 1988.

Digital Library

[18]

{18} A. Agarwal and A. Gupta, "Memory-reference characteristics of multiprocessor applications under MACH," in Proc. 1988 ACM SIGMETRICS Conf. Measurement Modeling Comput. Syst., 1988, pp. 215-225.

[19]

{19} S. J. Eggers and R. H. Katz, "A characterization of sharing in parallel programs and its application to coherency protocol evaluation," in Proc. 15th Annu. Int. Symp. Comput. Architecture, 1988, pp. 373-382.

Digital Library

[20]

{20} F. Darema-Rogers, G. F. Pfister, and K. So, "Memory access patterns of parallel scientific programs," in Proc. 1987 ACM SIGMETRICS Conf. Measurement Modeling Comput. Syst., 1987, pp. 46-58.

Digital Library

Cited By

Psychou GRodopoulos DSabry MGemmeke TAtienza DNoll TCatthoor F(2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
https://dl.acm.org/doi/10.1145/3092699
Agbaria AAttiya HFriedman RVitenberg R(2004)Quantifying rollback propagation in distributed checkpointingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2004.01.00364:3(370-384)Online publication date: 1-Mar-2004
https://dl.acm.org/doi/10.1016/j.jpdc.2004.01.003
Elnozahy EAlvisi LWang YJohnson D(2002)A survey of rollback-recovery protocols in message-passing systemsACM Computing Surveys10.1145/568522.56852534:3(375-408)Online publication date: 1-Sep-2002
https://dl.acm.org/doi/10.1145/568522.568525
Show More Cited By

Recommendations

A Checkpointing Page Store for Write-Once Optical Disk

A model paging system is described for write-once optical disk. The paging system checkpoints pages on a nonerasable, failure-resistant medium as a side effect of virtual memory operation. Most pages are already checkpointed at any given time. It is ...
Quasi-synchronous checkpointing and failure recovery in distributed systems
A Survey of Recoverable Distributed Shared Virtual Memory Systems

Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 39, Issue 4

April 1990

188 pages

ISSN:0018-9340

Editor:
Ming T. Liu
Ohio State Univ., Columbus

Issue’s Table of Contents

Copyright © Copyright © 1990 IEEE. All Rights Reserved.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 April 1990

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Psychou GRodopoulos DSabry MGemmeke TAtienza DNoll TCatthoor F(2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
https://dl.acm.org/doi/10.1145/3092699
Agbaria AAttiya HFriedman RVitenberg R(2004)Quantifying rollback propagation in distributed checkpointingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2004.01.00364:3(370-384)Online publication date: 1-Mar-2004
https://dl.acm.org/doi/10.1016/j.jpdc.2004.01.003
Elnozahy EAlvisi LWang YJohnson D(2002)A survey of rollback-recovery protocols in message-passing systemsACM Computing Surveys10.1145/568522.56852534:3(375-408)Online publication date: 1-Sep-2002
https://dl.acm.org/doi/10.1145/568522.568525
Park TLee IYeom H(2002)An efficient causal logging scheme for recoverable distributed shared memory systemsParallel Computing10.1016/S0167-8191(02)00165-528:11(1549-1572)Online publication date: 1-Nov-2002
https://dl.acm.org/doi/10.1016/S0167-8191%2802%2900165-5
Morin CKermarrec ABanâtre MGefflaut A(2000)An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM ArchitecturesIEEE Transactions on Computers10.1109/12.85953749:5(414-430)Online publication date: 1-May-2000
https://dl.acm.org/doi/10.1109/12.859537
Park TYeom H(2000)A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory SystemsThe Journal of Supercomputing10.1023/A:100811651140215:3(295-320)Online publication date: 1-Mar-2000
https://dl.acm.org/doi/10.1023/A%3A1008116511402
Kongmunvattana ATzeng N(1999)Logging and Recovery in Adaptive Software Distributed Shared Memory SystemsProceedings of the 18th IEEE Symposium on Reliable Distributed Systems10.5555/829524.831044Online publication date: 18-Oct-1999
https://dl.acm.org/doi/10.5555/829524.831044
Morin CPuaut I(1997)A Survey of Recoverable Distributed Shared Virtual Memory SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/71.6154418:9(959-969)Online publication date: 1-Sep-1997
https://dl.acm.org/doi/10.1109/71.615441
Wang Y(1997)Consistent Global Checkpoints that Contain a Given Set of Local CheckpointsIEEE Transactions on Computers10.1109/12.58805946:4(456-468)Online publication date: 1-Apr-1997
https://dl.acm.org/doi/10.1109/12.588059
Scales DLam M(1996)Transparent fault tolerance for parallel applications on networks of workstationsProceedings of the 1996 annual conference on USENIX Annual Technical Conference10.5555/1268299.1268326(27-27)Online publication date: 22-Jan-1996
https://dl.acm.org/doi/10.5555/1268299.1268326
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents