Accelerating multicore reuse distance analysis with sampling and parallelization
DL Schuff, M Kulkarni, VS Pai - … of the 19th international conference on …, 2010 - dl.acm.org
DL Schuff, M Kulkarni, VS Pai
Proceedings of the 19th international conference on Parallel architectures …, 2010•dl.acm.orgReuse distance analysis is a well-established tool for predicting cache performance, driving
compiler optimizations, and assisting visualization and manual optimization of programs.
Existing reuse distance analysis methods either do not account for the effects of
multithreading, or suffer severe performance penalties. This paper presents a sampled,
parallelized method of measuring reuse distance profiles for multithreaded programs,
modeling private and shared cache configurations. The sampling technique allows it to …
compiler optimizations, and assisting visualization and manual optimization of programs.
Existing reuse distance analysis methods either do not account for the effects of
multithreading, or suffer severe performance penalties. This paper presents a sampled,
parallelized method of measuring reuse distance profiles for multithreaded programs,
modeling private and shared cache configurations. The sampling technique allows it to …
Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance profiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.
ACM Digital Library