With the increasing gap between processor speed and memory speed, a sophisticated memory hierarchy is key to high performance. However, the operating system tends to use the memory hierarchy poorly. This thesis presents a comprehensive characterization and optimization of the performance of multiprocessor memory hierarchies for operating systems. The operating system instruction cache misses are reduced by 81% using a code reorganization scheme tailored to the operating system, guarded sequential prefetching, and stream buffers. The operating system data cache misses are reduced by 53% using a DMA-like pipelined block transfer engine, a selective update protocol, data relocation and privatization, and data prefetching in miss hot spots. The overall OS time is reduced by 32%. The cost-performance trade-offs of the software/hardware optimization schemes are also discussed.