Ph.D. Theses

Cache Performance Optimization of Parallel Programs

By Wesley K. Kaplow
Advisor: Boleslaw K. Szymanski
April 13, 1998

Recent advances in computer architecture and technology have reduced the number of cycles required for each instruction and increased the raw clock rate of processors. Technology has also increased memory capacity, but not main memory speed. Computer architectures increase memory throughput by using cache memories that enable processor to compute at full speed only when programs exhibit enough data locality for the caches to be effective. This Thesis examines the issue of cache performance prediction for high-performance computing.

Cache performance prediction is critical for the run-time performance estimation required for parallel program scheduling, and the determination of the parameters for the optimization of loop nest codes. In this Thesis we start by examining methods for cache performance estimation, and show that it is possible to obtain very accurate code execution performance prediction using a method based on compile-time analysis. This technique generates at compile-time nearly the exact sequence of memory references that is followed when the code is executed on a target machine. These compile-time references are fed into an architecturally accurate cache model for the target, and this allows direct measurement of cache performance for various parameters. By using a heuristic search for optimal values for the parameters, the code optimization process can be enhanced.

Next, we show that substantial improvements in the performance of the compile-time method can be made by generating only those references that may cause a cache miss by ignoring those that are guaranteed to be in the cache. Using the discrete-event simulation technique, we introduce a novel method called miss-driven cache simulation to take advantage of this fact. Results show that this technique improves simulation performance by an order of magnitude, making the technique viable for compile-time use.

Unfortunately, the methods described above, as well as other cache performance estimation and improvement methodologies, do not work for program loop nests that have array references whose subscripts have run-time data dependencies. These codes have data reference patterns that are not determined until run-time and therefore purely compile-time methods cannot predict or improve cache performance. For these irregular problems we present a novel method called run-time reference clustering. This method makes compile- time code modifications so that the critical data sets of the program are referenced in clusters improving the data reference locality. The contents of clusters are determined at run-time using the same irregular partitioning techniques that are used for program decomposition for parallel execution. We show how this method can be used to dramatically improve the performance of several common scientific programs.

Return to main PhD Theses page