[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/782814.782834acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Modeling and optimization of non-blocking checkpointing for optimistic simulation on myrinet clusters

Published: 23 June 2003 Publication History

Abstract

Checkpointing and Communication Library (CCL) is a recently developed software implementing CPU offloaded checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. Specifically, CCL implements a non-blocking execution mode of memory-to-memory data copy associated with checkpoint operations, based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must sometimes be employed for several reasons, such as maintenance of data consistency, thus adding some overhead to (otherwise CPU cost-free) non-blocking checkpoint operations. In this paper we present a cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization <em>MC</em>. With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model. We have implemented <em>MC</em> within CCL, and we also report experimental results demonstrating the performance benefits from this optimized re-synchronization semantic, in terms of increase in the execution speed, for a Personal Communication System (PCS) simulation application.

References

[1]
A. Boukerche, S. K. Das, A. Fabbri, and O. Yildz. Exploiting model independence for parallel PCS network simulation. In Proc. of the 13th Workshop on Parallel and Distributed Simulation, pages 166--173. ACM/IEEE Computer Society, May 1999.
[2]
J. Briner. Fast parallel simulation of digital systems. In Proc. of Multiconf. on Advances in Parallel and Distributed Simulation, pages 71--77, 1991.
[3]
D. Bruce. The treatment of state in optimistic systems. In Proc. of the 9th Workshop on Parallel and Distributed Simulation, pages 40--49. ACM/SCS, June 1995.
[4]
C. D. Carothers, D. Bauer, and S. Pearce. ROSS: a high performance modular Time Warp system. In Proc. of the 14th Workshop on Parallel and Distributed Simulation, pages 53--60. ACM/IEEE Computer Society, May 2000.
[5]
C. D. Carothers, R. M. Fujimoto, P. England, and Y. B. Lin. Distributed simulation of large-scale PCS networks. In Proc. of the 2nd IEEE International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages 2--6. IEEE Computer Society, 1994.
[6]
C. D. Carothers, R. M. Fujimoto, and Y. B. Lin. A case study in simulating PCS networks using Time Warp. In Proc. of the 9th Workshop on Parallel and Distributed Simulation, pages 87--94. ACM/SCS, June 1995.
[7]
E. Elnozahy, D. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Proc. of the 11th Symposium on Reliable Distributed Systems, pages 39--47. IEEE Computer Society, Oct. 1992.
[8]
A. Ferscha and J. Luthi. Estimating rollback overhead for optimism control in Time Warp. In Proc. of the 28th Annual Simulation Symposium, pages 2--12. IEEE Computer Society, Apr. 1995.
[9]
J. Fleischmann and P. Wilsey. Comparative analysis of periodic state saving techniques in time warp simulators. In Proc. of the 9th Workshop on Parallel and Distributed Simulation, pages 50--58. ACM/SCS, June 1995.
[10]
R. M. Fujimoto. Time Warp on a shared memory multiprocessor. Trans. of the Society for Computer Simulation, 6(3):211--239, 1989.
[11]
R. M. Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30--53, Oct. 1990.
[12]
R. M. Fujimoto. Performance of Time Warp under synthetic workloads. In Proc. of the Multiconf. on Distributed Simulation, pages 23--28. Society for Computer Simulation, Jan. 1990.
[13]
D. R. Jefferson. Virtual time. ACM Trans. on Programming Languages and System, 7(3):404--425, July 1985.
[14]
K. Li, J. Naughton, and J. Plank. Low latency concurrent checkpointing for parallel programs. IEEE Trans. on Parallel and Distributed Systems, 5(8):474--479, Aug. 1994.
[15]
Y. Lin, B. Preiss, W. Loucks, and E. Lazowska. Selecting the checkpoint interval in Time Warp simulation. In Proc. of the 7th Workshop on Parallel and Distributed Simulation, pages 3--10. ACM/SCS, 1993.
[16]
Y. B. Lin and E. D. Lazowska. Processor scheduling for Time Warp parallel simulation. In Advances in Parallel and Distributed Simulation, pages 11--14, 1991.
[17]
MYRICOM. LANai 4. Draft, Feb. 1999.
[18]
S. Pakin, M. Lauria, and A. Chen. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Proc. of Supercomputing'95. ACM/IEEE Computer Society, Dec. 1995.
[19]
A. C. Palaniswamy and P. A. Wilsey. An analytical comparison of periodic checkpointing and incremental state saving. In Proceedings of the 7th Workshop on Parallel and Distributed Simulation, pages 127--134. ACM/SCS, 1993.
[20]
J. Plank, M. Beck, and G. Kingsley. Libckpt: Transparent checkpointing under UNIX. In Proc. of USENIX Winter Technical Conference, pages 213--223. USENIX Association, 1995.
[21]
B. R. Preiss, W. M. Loucks, and D. MacIntyre. Effects of the checkpoint interval on time and space in Time Warp. ACM Trans. on Modeling and Computer Simulation, 4(3):223--253, July 1994.
[22]
F. Quaglia. A cost model for selecting checkpoint positions in Time Warp parallel simulation. IEEE Trans. on Parallel and Distributed Systems, 12(4):346--362, Feb. 2001.
[23]
F. Quaglia and A. Santoro. Nonblocking checkpointing for optimistic parallel simulation: Description and an implementation. IEEE Trans. on Parallel and Distributed Systems, 14(6):593--610, June 2003.
[24]
F. Quaglia, A. Santoro, and B. Ciciani. Tuning of the checkpointing and communication library for optimistic simulation on Myrinet based NOWs. In Proc. of the 9th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 241--248. IEEE Computer Society, Oct. 2001.
[25]
R. Ronngren and R. Ayani. Adaptive checkpointing in Time Warp. In Proc. of the 8th Workshop on Parallel and Distributed Simulation, pages 110--117. ACM/SCS, July 1994.
[26]
R. Ronngren, M. Liljenstam, R. Ayani, and J. Montagnat. Transparent incremental state saving in Time Warp parallel discrete event simulation. In Proc. of the 10th Workshop on Parallel and Distributed Simulation, pages 70--77. ACM/IEEE Computer Society, May 1996.
[27]
A. Santoro. Semi-Asynchronous Checkpointing for Optimistic Parallel Simulation. PhD thesis, Dipartimento di Informatica e Sistemistica, University of Rome La Sapienza, Feb. 2003.
[28]
A. Santoro and F. Quaglia. Benefits from semi-asynchronous checkpointing for Time Warp simulations of a large state PCS model. In Proc. of the Winter Simulation Conference, pages 1339--1345. Society for Computer Simulation, Dec. 2001.
[29]
A. Silberschatz and P. Galvin. Operating System Concepts. Addison-Wesley Publishing Company, Reading, Massachusetts, 1994.
[30]
S. Skold and R. Ronngren. Event sensitive state saving in Time Warp parallel discrete event simulation. In Proc. of the Winter Simulation Conference, pages 653--660. Society for Computer Simulation, Dec. 1996.
[31]
H. Soliman and A. Elmaghraby. An analytical model for hybrid checkpointing in Time Warp distributed simulation. IEEE Trans. on Parallel and Distributed Systems, 9(10):947--951, Oct. 1998.
[32]
S. Srinivasan, M. J. Lyell, P. F. Reynolds, Jr., and J. Wehrwein. Implementation of reductions in support of PDES on a network of workstations. In Proc. of the 12th Workshop on Parallel and Distributed Simulation, pages 116--123. ACM/IEEE Computer Society, May 1998.
[33]
W. Stallings. Operating Systems, Internals and Design Principles. Prentice Hall, 1998.
[34]
J. Steinman. Incremental state saving in SPEEDES using C plus plus. In Proc. of the Winter Simulation Conference, pages 687--696. Society for Computer Simulation, Dec. 1993.
[35]
D. West and K. Panesar. Automatic incremental state saving. In Proc. of the 10th Workshop on Parallel and Distributed Simulation, pages 78--85. ACM/IEEE Computer Society, May 1996.
[36]
F. Wieland. Practical parallel simulation applied to aviation control. In Proc. of the 15th Workshop on Parallel and Distributed Simulation, pages 109--116. ACM/IEEE Computer Society, May 2001.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '03: Proceedings of the 17th annual international conference on Supercomputing
June 2003
380 pages
ISBN:1581137338
DOI:10.1145/782814
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DMA
  2. checkpointing
  3. optimistic simulation
  4. performance optimization

Qualifiers

  • Article

Conference

ICS03
Sponsor:
ICS03: International Conference on Supercomputing 2003
June 23 - 26, 2003
CA, San Francisco, USA

Acceptance Rates

ICS '03 Paper Acceptance Rate 36 of 171 submissions, 21%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 352
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media